Re: Dealing with Autodecode
On Wednesday, 1 June 2016 at 01:36:43 UTC, Adam D. Ruppe wrote: D USERS **WANT** BREAKING CHANGES THAT INCREASE OVERALL CODE QUALITY WITH A SIMPLE MIGRATION PATH This. I only recently started full scale use of D, but I lurked here for years. D has a few quirks here and there, but overall it's a fantastic language. However the biggest putting off factor for me is the attitude of the leadership towards fixing the issues and completing the language. The idea of autodecoding is very natural to appear for someone who only recently discovered Unicode. Whoa, instead of code pages we now have "unicode code points". Great. Only much later the person realizes that working with code points isn't always correct. So I don't blame anyone for designing/implementing autodecoding years ago. But. Not acknowledging that autodecoding is seriously wrong now, looks like a complete brain damage. The entire community seems united in the view that autodecoding is both slow and usually wrong. The users are begging for this breaking change. There's a number of approaches about handling the deprecation. Even the code that for some reason really needs to work with code points will benefit from explicitly stating that it needs code points. But no we must endure this madness forever. I realize that priorities of a language user might be different from those of a language leadership. With fixed (removed) autodecoding the user gets a cleaner language. Their program will work faster and is easier to reason about. User's brain cycles are not wasted for useless crap like working around autodecoding. On the other hand, the language/stdlib designer now has to admit their initial design was sub-optimal. Their books and articles are now obsolete. And they will be the ones who receive complaints from the inevitable few upset with the change. However keeping the current situation means for me personally: 1. Not switching to D wholesale, but just toying with it. 2. Even when using D for work I don't want to talk about it to others. I was seriously thinking about starting a D-learning seminar at work, and I still might, but the thought that autodecoding is going to stay is cooling my enthusiasm. I just did a numerical app in D, where it shines, I think. However much of my work code is dealing with huge texts. I don't want to fight with autodecode at every step. I'd like arrays of chars be arrays of chars without any magic crap auto-inserted behind my back. I don't want to become an expert in avoiding language pitfalls (The reason I abandoned C++ years ago). I also don't want to re-implement the staple string processing routines (though I might, if at least the language constructs work without autodecode, which seems not the case here). Think about it. 99% of code working with code points is _broken_ anyway. (In the sense, that the usual assumption is that code point represents a character, while in fact it does not). When working with code units, the developer will notice the problem right away. When working with code points, the problem is not apparent until years later (essentially what happened to D itself). Feel free to ignore my non-D-core-dev comment. Even though I suspect many D users may agree with me. An even larger number of potential D users does not want autodecoding either. Thanks, Kirill
Re: D Embedded Database v0.1 Released
On Tuesday, 31 May 2016 at 22:08:00 UTC, Stefan Koch wrote: Nice effort. How would you like collaboration with the SQLite-D project. Thanks. Correct me if I'm wrong but SQLite-D is a compile time SQLite3 file reader. If so, I can predict not many common parts. Maybe the one would be a data deserialization component however I didn't check how it's done in SQLite-D. With has similar goals albeit file format compatible to SQLite. When I was selecting possible file format I was thinking about SQLite one. I am actually a fan of the SQLite project. However there are some shortcomings present in current SQlite3 format: - SQlite3 is not really a one file storage (i.e. journal file) - it gets fragmented very quickly (check out design goals for SQLite4) - it's overcomplicated and non deterministic with respect to real time software - it has unnecessary overhead because every column is actually a variant type Add to this the main goal of replacing SQL with D ranges+algorithms. In result it turned out it would be great to have an alternate format. BTW. Would someone be so kind and post the above paragraph on Reddit under a comment about Sqlite db. I'm not registered there. Piotrek
[Issue 15885] float serialized to JSON loses precision
https://issues.dlang.org/show_bug.cgi?id=15885 --- Comment #4 from github-bugzi...@puremagic.com --- Commits pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/7a486d9d038448595c74aa4ef4bd7d9e952a4b64 Fix issue 15885 - numeric values serialized to JSON lose precision. https://github.com/dlang/phobos/commit/f4ad734aad6e3b2dd4881508d2b15eebb732a26c Merge pull request #4345 from tsbockman/issue-15885-tsb Fix issue 15885 - float serialized to JSON loses precision --
[Issue 15885] float serialized to JSON loses precision
https://issues.dlang.org/show_bug.cgi?id=15885 github-bugzi...@puremagic.com changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --
Re: The Case Against Autodecode
On 5/31/2016 4:00 PM, ag0aep6g wrote: Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I suppose you mean UTF-32/UCS-4. [1] https://en.wikipedia.org/wiki/UTF-16 Thanks for the correction.
Re: The Case Against Autodecode
On Wednesday, 1 June 2016 at 02:17:21 UTC, Jonathan M Davis wrote: ... This thread is going in circles; the against crowd has stated each of their arguments very clearly at least five times in different ways. The cost/benefit problems with auto decoding are as clear as day. If the evidence already presented in this thread (and in the many others) isn't enough to convince people of that, then I don't think anything else said will have an impact. I don't want to sound like someone telling people not to discuss this anymore, but honestly, what is continuing this thread going to accomplish?
Re: Dealing with Autodecode
On Wednesday, 1 June 2016 at 02:58:36 UTC, Brad Roberts wrote: ...the rate of bug fixing which exceeds the rate of fix pulling. Speaking of which: https://github.com/dlang/phobos/pull/4345 https://github.com/dlang/phobos/pull/3973
Re: Reddit announcements
On Tuesday, 31 May 2016 at 18:57:29 UTC, o-genki-desu-ka wrote: Many nice announcements here last week. I put some on reddit. Thank you for doing this! I agree previous posts though, that this is too many at once. Also, I think posting a link directly to the project instead of the forum post would have been better.
[Issue 16107] [ICE] - Internal error: backend/cgcod.c 2297
https://issues.dlang.org/show_bug.cgi?id=16107 --- Comment #1 from b2.t...@gmx.com --- definition of Foo can be reduced to class Foo { alias TreeItemType = typeof(this); TreeItemSiblings!TreeItemType _siblings;// remove this decl TreeItemChildren!TreeItemType _children;// or this one : OK } The content was initially a mixin template, which explains why it was incoherant...anyway still the ICE. --
Re: Button: A fast, correct, and elegantly simple build system.
On Tuesday, 31 May 2016 at 14:28:02 UTC, Dicebot wrote: Can it be built from just plain dmd/phobos install available? One of major concernc behind discussion that resulted in Atila reggae effort is that propagating additional third-party dependencies is very damaging for build systems. Right now Button seems to fail rather hard on this front (i.e. Lua for build description + uncertain amount of build dependencies for Button itself). Building it only requires dmd+phobos+dub. Why is having dependencies so damaging for build systems? Does it really matter with a package manager like Dub? If there is another thread that answers these questions, please point me to it. The two dependencies Button itself has could easily be moved into the same project. I kept them separate because they can be useful for others. These are the command-line parser and IO stream libraries. As for the dependency on Lua, it is statically linked into a separate executable (called "button-lua") and building it is dead-simple (just run make). Using the Lua build description generator is actually optional, it's just that writing build descriptions in JSON would be horribly tedious.
[Issue 16107] New: [ICE] - Internal error: backend/cgcod.c 2297
https://issues.dlang.org/show_bug.cgi?id=16107 Issue ID: 16107 Summary: [ICE] - Internal error: backend/cgcod.c 2297 Product: D Version: D2 Hardware: x86_64 OS: Linux Status: NEW Severity: critical Priority: P1 Component: dmd Assignee: nob...@puremagic.com Reporter: b2.t...@gmx.com The following code, compiled with DMD 2.071.1-b2 crashes the compiler: === import std.stdio, std.traits; struct TreeItemChildren(T){} struct TreeItemSiblings(T){} class Foo { enum isStruct = is(typeof(this) == struct); static if (isStruct) alias TreeItemType = typeof(this)*; else alias TreeItemType = typeof(this); TreeItemSiblings!TreeItemType _siblings;// remove this decl TreeItemChildren!TreeItemType _children;// or this one : OK } template Bug(T) { bool check() { bool result; import std.meta: aliasSeqOf; import std.range: iota; foreach(i; aliasSeqOf!(iota(0, T.tupleof.length))) { alias MT = typeof(T.tupleof[i]); static if (is(MT == struct)) result |= Bug!MT; // result = result | ... : OK if (result) break; // remove this line : OK } return result; } enum Bug = check(); } void main() { assert(!Bug!Foo); } produces > Internal error: backend/cgcod.c 2297 The comments in the code indicates each time that the bug doesn't happen when the stuff is commented. --
Re: Button: A fast, correct, and elegantly simple build system.
On Tuesday, 31 May 2016 at 10:15:14 UTC, Atila Neves wrote: On Monday, 30 May 2016 at 19:16:50 UTC, Jason White wrote: I am pleased to finally announce the build system I've been slowly working on for over a year in my spare time: snip In fact, there is some experimental support for automatic conversion of Makefiles to Button's build description format using a fork of GNU Make itself: https://github.com/jasonwhite/button-make I'm going to take a look at that! I think the Makefile converter is probably the coolest thing about this build system. I don't know of any other build system that has done this. The only problem is that it doesn't do well with Makefiles that invoke make recursively. I tried compiling Git using it, but Git does some funky stuff with recursive make like grepping the output of the sub-make. - Can automatically build when an input file is modified (using inotify). Nope, I never found that interesting. Possibly because I keep saving after every edit in OCD style and I really don't want things running automatically. I constantly save like a madman too. If an incremental build is sufficiently fast, it doesn't really matter. You can also specify a delay so it accumulates changes and then after X milliseconds it runs a build. - Recursive: It can build the build description as part of the build. I'm not sure what that means. reggae copies CMake here and runs itself when the build description changes, if that's what you mean. It means that Button can run Button as a build task (and it does it correctly). A child Button process reports its dependencies to the parent Button process via a pipe. This is the same mechanism that detects dependencies for ordinary tasks. Thus, there is no danger of doing incorrect incremental builds when recursively running Button like there is with Make. - Lua is the primary build description language. In reggae you can pick from D, Python, Ruby, Javascript and Lua. That's pretty cool. It is possible for Button to do the same, but I don't really want to support that many languages. In fact, the Make and Lua build descriptions both work the same exact way - they output a JSON build description for Button to use. So long as someone can write a program to do this, they can write their build description in it.
Re: Button: A fast, correct, and elegantly simple build system.
On Tuesday, 31 May 2016 at 03:40:32 UTC, rikki cattermole wrote: Are you on Freenode (no nick to name right now)? I would like to talk to you about a few ideas relating to lua and D. No, I'm not on IRC. I'll see if I can find the time to hop on this weekend.
Re: Dealing with Autodecode
On 5/31/2016 7:40 PM, Walter Bright via Digitalmars-d wrote: On 5/31/2016 7:28 PM, Jonathan M Davis via Digitalmars-d wrote: The other critical thing is to make sure that Phobos in general works with byDChar, byCodeUnit, etc. For instance, pretty much as soon as I started trying to use byCodeUnit instead of naked strings, I ran into this: https://issues.dlang.org/show_bug.cgi?id=15800 That was posted 3 months ago. No PR to fix it (though it likely is an easy fix). If we can't get these things fixed in Phobos, how can we tell everyone else to fix their code? I hope that wasn't a serious question. The answer is trivial. The rate of incoming bug reports exceeds the rate of bug fixing which exceeds the rate of fix pulling. Has since about the dawn of time.
Re: Dealing with Autodecode
On 5/31/2016 7:28 PM, Jonathan M Davis via Digitalmars-d wrote: The other critical thing is to make sure that Phobos in general works with byDChar, byCodeUnit, etc. For instance, pretty much as soon as I started trying to use byCodeUnit instead of naked strings, I ran into this: https://issues.dlang.org/show_bug.cgi?id=15800 That was posted 3 months ago. No PR to fix it (though it likely is an easy fix). If we can't get these things fixed in Phobos, how can we tell everyone else to fix their code?
Re: Dealing with Autodecode
On 5/31/2016 6:36 PM, Adam D. Ruppe wrote: Our preliminary investigation found about 130 places in Phobos that need to be changed. That's not hard to fix! PRs please!
Re: Dealing with Autodecode
On 05/31/2016 09:36 PM, Adam D. Ruppe wrote: version(string_migration) deprecated void popFront(T)(ref T t) if(isSomeString!T) { static assert(0, "this is crap, fix your code."); } else deprecated("use -versionstring_migration to fix your buggy code, would you like to know more?") /* existing popFront here */ I vote we use Adam's exact verbiage, too! :) D USERS **WANT** BREAKING CHANGES THAT INCREASE OVERALL CODE QUALITY WITH A SIMPLE MIGRATION PATH Yes. This. If I wanted an endless bucket of baggage, I'd have stuck with C++. 3) A wee bit longer, we exterminate all this autodecoding crap and enjoy Phobos being a smaller, more efficient library. Yay! Profit!
Re: Dealing with Autodecode
On Tuesday, May 31, 2016 17:46:04 Walter Bright via Digitalmars-d wrote: > It is not practical to just delete or deprecate autodecode - it is too > embedded into things. What we can do, however, is stop using it ourselves > and stop relying on it in the documentation, much like [] is eschewed in > favor of std::vector in C++. > > The way to deal with it is to replace reliance on autodecode with .byDchar > (.byDchar has a bonus of not throwing an exception on invalid UTF, but using > the replacement dchar instead.) > > To that end, and this will be an incremental process: > > 1. Temporarily break autodecode such that using it will cause a compile > error. Then, see what breaks in Phobos and fix those to use .byDchar > > 2. Change examples in the documentation and the Phobos examples to use > .byDchar > > 3. Best practices should use .byDchar, .byWchar, .byChar, .byCodeUnit when > dealing with ranges/arrays of characters to make it clear what is happening. The other critical thing is to make sure that Phobos in general works with byDChar, byCodeUnit, etc. For instance, pretty much as soon as I started trying to use byCodeUnit instead of naked strings, I ran into this: https://issues.dlang.org/show_bug.cgi?id=15800 But once Phobos no longer relies on autodecoding except maybe in places where we can't actually excise it completely without breaking code (and hopefully there are none of those), then we can look at how feasible the full removal of auto-decoding really is. IMHO, leaving it in is a _huge_ piece of technical debt that we don't want and probably can't afford, so I really don't think that we should just assume that we can't remove it due to the breakage that it would cause. But we definitely have work to do before we can have Phobos in a state where it's reasonable to even make an attempt. byCodeUnit and friends were a good start, but we need to make it so that they're treated as first-class citizens, and they're not right now. - Jonathan M Davis
Re: Reddit announcements
On 5/31/16 2:57 PM, o-genki-desu-ka wrote: Many nice announcements here last week. I put some on reddit. https://www.reddit.com/r/programming/comments/4lwufi/d_embedded_database_v01_released/ https://www.reddit.com/r/programming/comments/4lwubv/c_to_d_converter_based_on_clang/ https://www.reddit.com/r/programming/comments/4lwu5p/coedit_2_ide_update_6_released/ https://www.reddit.com/r/programming/comments/4lwtxw/compiletime_sqlite_for_d_beta_release/ https://www.reddit.com/r/programming/comments/4lwtr0/button_a_fast_correct_and_elegantly_simple_build/ https://www.reddit.com/r/programming/comments/4lwtn9/first_release_of_powernex_an_os_kernel_written_in/ Very nice. Response has been positive. Thank you very much! -- Andrei
Re: The Case Against Autodecode
On Tuesday, May 31, 2016 23:36:20 Marco Leise via Digitalmars-d wrote: > Am Tue, 31 May 2016 16:56:43 -0400 > > schrieb Andrei Alexandrescu: > > On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote: > > > In the vast majority of cases what folks care about is full character > > > > How are you so sure? -- Andrei > > Because a full character is the typical unit of a written > language. It's what we visualize in our heads when we think > about finding a substring or counting characters. A special > case of this is the reduction to ASCII where we can use code > units in place of grapheme clusters. Exactly. How many folks here have written code where the correct thing to do is to search on code points? Under what circumstances is that even useful? Code points are a mid-level abstraction between UTF-8/16 and graphemes that are not particularly useful on their own. Yes, by using code points, we eliminate the differences between the encodings, but how much code even operates on multiple string types? Having all of your strings have the same encoding fixes the consistency problem just as well as autodecoding to dchar evereywhere does - and without the efficiency hit. Typically, folks operate on string or char[] unless they're talking to the Windows API, in which case, they need wchar[]. Our general recommendation is that D code operate on UTF-8 except when it needs to operate on a different encoding because of other stuff it has to interact with (like the Win32 API), in which case, ideally it converts those strings to UTF-8 once they get into the D code and operates on them as UTF-8, and anything that has to be output in a different encoding is operated on as UTF-8 until it needs to be outputed, in which case, it's converted to UTF-16 or whatever the target encoding is. Not much of anyone is recommending that you use dchar[] everywhere, but that's essentially what the range API is trying to force. I think that it's very safe to say that the vast majority of string processing either is looking to operate on strings as a whole or on individual, full characters within a string. Code points are neither. While code may play tricks with Unicode to be efficient (e.g. operating at the code unit level where it can rather than decoding to either code points or graphemes), or it might make assumptions about its data being ASCII-only, aside from explicit Unicode processing code, I have _never_ seen code that was actually looking to logically operate on only pieces of characters. While it may operate on code units for efficiency, it's always looking to be logically operating on string as a unit or on whole characters. Anyone looking to operate on code points is going to need to take into account the fact that they're not full characters, just like anyone who operates on code units needs to take into account the fact that they're not whole characters. Operating on code points as if they were characters - which is exactly what D currently does with ranges - is just plain wrong. We need to support operating at the code point level for those rare cases where it's actually useful, but autedecoding makes no sense. It incurs a performance penality without actually giving correct results except in those rare cases where you want code points instead of full characters. And only Unicode experts are ever going to want that. The average programmer who is not super Unicode savvy doesn't even know what code points are. They're clearly going to be looking to operate on strings as sequences of characters, not sequences of code points. I don't see how anyone could expect otherwise. Code points are a mid-level, Unicode abstraction that only those who are Unicode savvy are going to know or care about, let alone want to operate on. - Jonathan M Davis
Re: The Case Against Autodecode
On Tuesday, May 31, 2016 20:38:14 Nick Sabalausky via Digitalmars-d wrote: > On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote: > > On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote: > >> On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote: > >>> Let's put the question this way. Given the following string, what do > >>> *you* think walkLength should return? > >>> > >>> şŭt̥ḛ́k̠ > >> > >> The number of code units in the string. That's the contract promised and > >> honored by Phobos. -- Andrei > > > > Code points I mean. -- Andrei > > Yes, we know it's the contract. ***That's the problem.*** As everybody > is saying, it *SHOULDN'T* be the contract. > > Why shouldn't it be the contract? Because it's proven itself, both > logically (as presented by pretty much everybody other than you in both > this and other threads) and empirically (in phobos, warp, and other user > code) to be both the least useful and most PITA option. Exactly. Operating at the code point level rarely makes sense. What sorts of algorithms purposefully do that in a typical program? Unless you're doing very specific Unicode stuff or somehow know that your strings don't contain any graphemes that are made up of multiple code points, operating at the code point level is just bug-prone, and unless you're using dchar[] everywhere, it's slow to boot, because you're strings have to be decoded whether the algorithm needs to or not. I think that it's very safe to say that the vast majority of string algorithms are either able to operate at the code unit level without decoding (though possibly encoding another string to match - e.g. with a string comparison or search), or they have to operate at the grapheme level in order to deal with full characters. A code point is borderline useless on its own. It's just a step above the different UTF encodings without actually getting to proper characters. - Jonathan M Davis
Re: Variables should have the ability to be @nogc
On Tuesday, 31 May 2016 at 23:46:59 UTC, Marco Leise wrote: Am Tue, 31 May 2016 20:41:09 + schrieb Basile B.: The only thing is that I'm not sure about is the tri-state and the recursion. I cannot find a case where it would be justified. The recursion is simply there to find pointers in nested structs and their GcScan annotations: - the "auto" is like if there's no annotation. - the "yes" seems useless because there is no case where the scanner should fail to detect members that are managed by the GC. It's for this case that things are a bit vague. Otherwise only the "no" remains. So far I'll go for this: https://dpaste.dzfl.pl/e3023ba6a7e2 with another annotation type name, for example 'AddGcRange' or 'GcScan'.
Re: Dealing with Autodecode
On Wednesday, 1 June 2016 at 00:46:04 UTC, Walter Bright wrote: It is not practical to just delete or deprecate autodecode Yes, it is. We need to stop holding on to the mistakes of the past. 9 of 10 dentists agree that autodecoding is a mistake. Not just WAS a mistake, IS a mistake. It has ongoing cost. If we don't fix our attitude about these problems, we are going to turn into that very demon we despise, yea, even the next C++! And that's not a good thing. To that end, and this will be an incremental process: I have a better one, that we discussed on IRC last night: 1) put the string overloads for front and popFront on a version switch: version(string_migration) deprecated void popFront(T)(ref T t) if(isSomeString!T) { static assert(0, "this is crap, fix your code."); } else deprecated("use -versionstring_migration to fix your buggy code, would you like to know more?") /* existing popFront here */ At the same time, make sure the various byWhatever functions and structs are easily available. Our preliminary investigation found about 130 places in Phobos that need to be changed. That's not hard to fix! The static assert(0) version tells you the top-level call that triggered it. You go there, you add .byDchar or whatever, and recompile, it just works, migration achieved. Or better yet, you think about your code and fix it properly, boom, code quality improved. D USERS **WANT** BREAKING CHANGES THAT INCREASE OVERALL CODE QUALITY WITH A SIMPLE MIGRATION PATH 2) After a while, we swap the version conditions, so opting into it preserves the old behavior for a while. 3) A wee bit longer, we exterminate all this autodecoding crap and enjoy Phobos being a smaller, more efficient library.
Re: Free the DMD backend
On Tuesday, 31 May 2016 at 20:18:34 UTC, default0 wrote: I have no idea how licensing would work in that regard but considering that DMDs backend is actively maintained and may eventually even be ported to D, wouldn't it at some point differ enough from Symantecs "original" backend to simply call the DMD backend its own thing? The way I understand it is that no matter how different a derivative work (such as any modification to DMD) gets, it's still a derivative work, and is subject to the terms of the license of the original work.
Re: Dealing with Autodecode
On 5/31/2016 5:56 PM, Stefan Koch wrote: It is only going to get harder to remove it. Removing it from Phobos and adjusting the documentation as I suggested is the way forward regardless. If we can't get that done, how can we tell our users they have to do the same to their code?
Re: Transient ranges
On 5/31/16 4:59 PM, Dicebot wrote: On Tuesday, 31 May 2016 at 18:11:34 UTC, Steven Schveighoffer wrote: 1) Current definition of input range (most importantly, the fact `front` has to be @property-like) implies `front` to always return the same result until `popFront` is called. Regardless of property-like or not, this should be the case. Otherwise, popFront makes no sense. Except it isn't in many cases you call "bugs" :( If you want to use such "ranges", the compiler will not stop you. Just don't expect any help from Phobos. 2) For ranges that call predicates on elements to evaluate next element this can only be achieved by caching - predicates are never required to be pure. Or, by not returning different things from your predicate. It is perfectly legal for predicate to be non-pure and that would be hugely annoying if anyone decided to prohibit it. Also even pure predicates may be simply very expensive to evaluate which can make `front` a silent pessimization. There's no requirement or need to prevent it. Just a) don't do it, or b) deal with the consequences. This is like saying RedBlackTree is broken when I give it a predicate of "a == b". RBL at least makes certain demands about valid predicate can be. This is not case for ranges in general. RedBlackTree with "a == b" will compile and operate. It just won't do much red-black-tree-like things. 3) But caching is sub-optimal performance wise and thus bunch of Phobos algorithms violate `front` consistency / cheapness expectation evaluating predicates each time it is called (liked map). I don't think anything defensively caches front in case the next call to front is different, unless that's specifically the reason for the range. And that makes input ranges violate implication #1 (front stability) casually to the point it can't be relied at all and one has to always make sure it is only evaluated once (make stack local copy or something like that). That's a little much. If you expect such things, you can construct them. There's no way for the functions to ascertain what your lambda is going to do (halting problem) and determine to cache or not based on that. I think we should be aware that the range API doesn't prevent bugs of all kinds. There's only so much analysis the compiler can do. This is a totally valid code I want to actually work and not be discarded as "bug". Then it's not a bug? It's going to work just fine how you specified it. I just don't consider it a valid "range" for general purposes. You can do this if you want caching: only(0).map!(x => uniform(0, 10)).cache -Steve
Re: Free the DMD backend
On Tuesday, 31 May 2016 at 20:12:33 UTC, Russel Winder wrote: On Tue, 2016-05-31 at 10:09 +, Atila Neves via Digitalmars-d wrote: […] No, no, no, no. We had LDC be the default already on Arch Linux for a while and it was a royal pain. I want to choose to use LDC when and if I need performance. Otherwise, I want my projects to compile as fast possible and be able to use all the shiny new features. So write a new backend for DMD the licence of which allows DMD to be in Debian and Fedora. LDC shouldn't be the default compiler to be included in Debian or Fedora. Reference compiler and the default D compiler in a particular distribution are two independent things.
Re: Dealing with Autodecode
On 5/31/16 8:46 PM, Walter Bright wrote: It is not practical to just delete or deprecate autodecode - it is too embedded into things. What we can do, however, is stop using it ourselves and stop relying on it in the documentation, much like [] is eschewed in favor of std::vector in C++. The way to deal with it is to replace reliance on autodecode with .byDchar (.byDchar has a bonus of not throwing an exception on invalid UTF, but using the replacement dchar instead.) To that end, and this will be an incremental process: 1. Temporarily break autodecode such that using it will cause a compile error. Then, see what breaks in Phobos and fix those to use .byDchar 2. Change examples in the documentation and the Phobos examples to use .byDchar 3. Best practices should use .byDchar, .byWchar, .byChar, .byCodeUnit when dealing with ranges/arrays of characters to make it clear what is happening. I gotta be honest, if the end of this tunnel doesn't have a char[] array which acts like an array in all circumstances, I see little point in changing anything. -Steve
Re: The Case Against Autodecode
On 5/31/16 4:38 PM, Timon Gehr wrote: On 31.05.2016 21:51, Steven Schveighoffer wrote: On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote: On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...] Does walkLength yield the same number for all representations? Let's put the question this way. Given the following string, what do *you* think walkLength should return? Compiler error. What about e.g. joiner? Compiler error. Better than what it does now. -Steve
Re: Dealing with Autodecode
On Wednesday, 1 June 2016 at 00:46:04 UTC, Walter Bright wrote: It is not practical to just delete or deprecate autodecode - it is too embedded into things. Which Things ? The way to deal with it is to replace reliance on autodecode with .byDchar (.byDchar has a bonus of not throwing an exception on invalid UTF, but using the replacement dchar instead.) To that end, and this will be an incremental process: So does this mean we intend to carry the auto-decoding wart with us into the future. And telling everyone : "The oblivious way is broken we just have it for backwards compatibility ?" To come back to c++ [] vs. std.vector. The actually have valid reasons; mainly c compatibility. To keep [] as a pointer. I believe As of now D is still flexible enough to make a radical change. We cannot keep putting this off! It is only going to get harder to remove it.
Dealing with Autodecode
It is not practical to just delete or deprecate autodecode - it is too embedded into things. What we can do, however, is stop using it ourselves and stop relying on it in the documentation, much like [] is eschewed in favor of std::vector in C++. The way to deal with it is to replace reliance on autodecode with .byDchar (.byDchar has a bonus of not throwing an exception on invalid UTF, but using the replacement dchar instead.) To that end, and this will be an incremental process: 1. Temporarily break autodecode such that using it will cause a compile error. Then, see what breaks in Phobos and fix those to use .byDchar 2. Change examples in the documentation and the Phobos examples to use .byDchar 3. Best practices should use .byDchar, .byWchar, .byChar, .byCodeUnit when dealing with ranges/arrays of characters to make it clear what is happening.
Re: The Case Against Autodecode
On 05/31/2016 01:23 PM, Andrei Alexandrescu wrote: On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote: The standard library has to fight against itself because of autodecoding! The vast majority of the algorithms in Phobos are special-cased on strings in an attempt to get around autodecoding. That alone should highlight the fact that autodecoding is problematic. The way I see it is it's specialization to speed things up without giving up the higher level abstraction. -- Andrei Problem is, that "higher"[1] level abstraction you don't want to give up (ie working on code points) is rarely useful, and yet the default is to pay the price for something which is rarely useful. [1] It's really the mid-level abstraction - grapheme is the high-level one (and more likely useful).
Re: The Case Against Autodecode
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote: On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote: On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote: Let's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠ The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei Code points I mean. -- Andrei Yes, we know it's the contract. ***That's the problem.*** As everybody is saying, it *SHOULDN'T* be the contract. Why shouldn't it be the contract? Because it's proven itself, both logically (as presented by pretty much everybody other than you in both this and other threads) and empirically (in phobos, warp, and other user code) to be both the least useful and most PITA option.
Re: The Case Against Autodecode
On 5/31/2016 1:57 AM, Chris wrote: 1. Given you experience with Warp, how hard would it be to clean Phobos up? It's not hard, it's just a bit tedious. 2. After recoding a number of Phobos functions, how much code did actually break (yours or someone else's)?. It's been a while so I don't remember exactly, but as I recall if the API had to change, I created a new overload or a new name, and left the old one as it is. For the std.path functions, I just changed them. While that technically changed the API, I'm not aware of any actual problems it caused. (Decoding file strings is a latent bug anyway, as pointed out elsewhere in this thread. It's a change that had to be made sooner or later.)
Re: Reddit announcements
On Tuesday, 31 May 2016 at 20:47:39 UTC, cym13 wrote: On Tuesday, 31 May 2016 at 19:33:46 UTC, John Colvin wrote: On Tuesday, 31 May 2016 at 18:57:29 UTC, o-genki-desu-ka wrote: Many nice announcements here last week. I put some on reddit. https://www.reddit.com/r/programming/comments/4lwufi/d_embedded_database_v01_released/ https://www.reddit.com/r/programming/comments/4lwubv/c_to_d_converter_based_on_clang/ https://www.reddit.com/r/programming/comments/4lwu5p/coedit_2_ide_update_6_released/ https://www.reddit.com/r/programming/comments/4lwtxw/compiletime_sqlite_for_d_beta_release/ https://www.reddit.com/r/programming/comments/4lwtr0/button_a_fast_correct_and_elegantly_simple_build/ https://www.reddit.com/r/programming/comments/4lwtn9/first_release_of_powernex_an_os_kernel_written_in/ I'm a bit concerned that people will react negatively to them all being dumped at once. Same here, moreover while some annoncements are about "ready to show" projects (button or powernex for example) others like "D embedded database" clearly are too young not to annoye /programming/ people IMHO. Currently there's a bot that posts everything to reddit, but it also somehow kills every discussion there. https://www.reddit.com/r/d_language/ Btw if you have better ideas how to solve this problem, you might get involved in this discussion: https://github.com/CyberShadow/DFeed/issues/63
Re: Split general into multiple threads
On Sunday, 29 May 2016 at 11:44:25 UTC, ZombineDev wrote: On Sunday, 29 May 2016 at 11:35:12 UTC, Seb wrote: [...] I like this list better than the current, but with one change: taking LDC out of core and renaming it to LDC and LLVM so other D projects that leverage LLVM can be hosted there (e.g. SDC, Calypso, CPP2D, etc) and to be on par with GDC. Having an additional LLVM category sounds reasonable. So we go with this new structure? Any major objections? It would be nice to be able to move conversations. Instead of "please use > D.learn instead", you would see "moved to more appropriate D.learn". See also: https://github.com/CyberShadow/DFeed/issues/67
Re: year to date pull statistics (week ending 2016-05-28)
On Tuesday, 31 May 2016 at 23:48:00 UTC, Brad Roberts wrote: total open: 252 created since 2016-01-01 and still open: 106 ... total open: 284 created since 2016-01-01 and still open: 142 Ouch - that's a huge spike! What happened to the idea from dconf to automatically assing PR managers based on a hard-coded maintainers for modules and randomly otherwise? Other ideas?
Re: Variables should have the ability to be @nogc
Am Tue, 31 May 2016 20:41:09 + schrieb Basile B.: > The only thing is that I'm not sure about is the tri-state and > the recursion. I cannot find a case where it would be justified. The recursion is simply there to find pointers in nested structs and their GcScan annotations: // A does not need scanning struct A { B b; } struct B { @noScan void* p; } The tri-state may not be necessary, I don't remember my rationale there. I do use GcScan.automatic as the default in memory allocation for example with the option to force it to yes or no. It gives you more control, just in case. -- Marco
Re: year to date pull statistics (week ending 2016-05-28)
total open: 284 created since 2016-01-01 and still open: 142 created closed delta 2016-05-29 - today 25 25 0 2016-05-22 - 2016-05-28 46 34-12 2016-05-15 - 2016-05-21 40 36 -4 2016-05-08 - 2016-05-14 82 55-27 2016-05-01 - 2016-05-07 37 59+22 2016-04-24 - 2016-04-30 74 85+11 2016-04-17 - 2016-04-23 51 58 +7 2016-04-10 - 2016-04-16 52 58 +6 2016-04-03 - 2016-04-09 64 44-20 2016-03-27 - 2016-04-02 65 60 -5 2016-03-20 - 2016-03-26 65 62 -3 2016-03-13 - 2016-03-19 44 51 +7 2016-03-06 - 2016-03-12 41 46 +5 2016-02-28 - 2016-03-05 54 47 -7 2016-02-21 - 2016-02-27 29 20 -9 2016-02-14 - 2016-02-20 32 36 +4 2016-02-07 - 2016-02-13 52 52 0 2016-01-31 - 2016-02-06 54 61 +7 2016-01-24 - 2016-01-30 40 37 -3 2016-01-17 - 2016-01-23 31 21-10 2016-01-10 - 2016-01-16 39 42 +3 2016-01-03 - 2016-01-09 26 33 +7 2016-01-01 - 2016-01-02 2 5 +3 --- ------ 10451027-18 https://auto-tester.puremagic.com/chart.ghtml?projectid=1
[OT] UTF-16
Am Tue, 31 May 2016 15:47:02 -0700 schrieb Walter Bright: > But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D > bet > on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful > pretty much only as a transitional encoding to talk with Windows APIs. I think so too, although more APIs than just Windows use UTF-16. Think of Java or ICU. Aside from their Java heritage they found that it is the fastest encoding for transcoding from and to Unicode as UTF-16 codepoints cover most 8-bit codepages. Also Qt defined a char as UTF-16 code point, but they probably regret it as the 'charmap' program KCharSelect is now unable to show Unicode characters >= 0x1. -- Marco
Re: The Case Against Autodecode
On 06/01/2016 12:47 AM, Walter Bright wrote: But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs. Nobody uses UCS-2 (it consumes far too much memory). Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I suppose you mean UTF-32/UCS-4. [1] https://en.wikipedia.org/wiki/UTF-16
Re: faster splitter
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu wrote: You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei In general, it might be beneficial to use ldc.intrinsics.llvm_expect (cf. __builtin_expect) for things like that in order to optimise basic block placement. (We should probably have a compiler-independent API for that in core.*, by the way.) In this case, the skip computation path is probably small enough for that not to matter much, though. Another thing that might be interesting to do (now that you have a "clever" baseline) is to start counting cycles and make some comparisons against manual asm/intrinsics implementations. For short(-ish) needles, PCMPESTRI is probably the most promising candidate, although I suspect that for \r\n scanning in long strings in particular, an optimised AVX2 solution might have higher throughput. Of course these observations are not very valuable without backing them up with measurements, but it seems like before optimising a generic search algorithm for short-needle test cases, having one's eyes on a solid SIMD baseline would be a prudent thing to do. — David
Re: The Case Against Autodecode
On 5/31/2016 1:20 PM, Marco Leise wrote: [...] I agree. I dealt the madness of code pages, Shift-JIS, EBCDIC, locales, etc., in the pre-Unicode days. Despite its problems, Unicode (and UTF-8) is a major improvement, and I mean major. 16 years ago, I bet that Unicode was the future, and events have shown that to be correct. But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).
Re: D Embedded Database v0.1 Released
On Saturday, 28 May 2016 at 14:08:18 UTC, Piotrek wrote: Short description A database engine for quick and easy integration into any D program. Full compatibility with D types and ranges. Design Goals (none is accomplished yet) - ACID - No external dependencies - Single file storage - Multithread support - Suitable for microcontrollers More info for interested at: Docs: https://gitlab.com/PiotrekDlang/DraftLib/blob/master/docs/database/index.md Code: https://gitlab.com/PiotrekDlang/DraftLib/tree/master/src The project is at its early stage of development. Piotrek Nice effort. How would you like collaboration with the SQLite-D project. With has similar goals albeit file format compatible to SQLite.
Re: Our Sister
Am Wed, 1 Jun 2016 01:06:36 +1000 schrieb Manu via Digitalmars-d: > D loves templates, but templates aren't a given. Closed-source > projects often can't have templates in the public API (ie, source > should not be available), and this is my world. Same effect for GPL code. Funny. (Template instantiations are like statically linking in the open source code.) -- Marco
Re: The Case Against Autodecode
Am Tue, 31 May 2016 16:56:43 -0400 schrieb Andrei Alexandrescu: > On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote: > > In the vast majority of cases what folks care about is full character > > How are you so sure? -- Andrei Because a full character is the typical unit of a written language. It's what we visualize in our heads when we think about finding a substring or counting characters. A special case of this is the reduction to ASCII where we can use code units in place of grapheme clusters. -- Marco
Re: Getting the parameters and other attributes belonging to the function overload with the greatest number of arguments
On Tuesday, 31 May 2016 at 20:46:37 UTC, Basile B. wrote: Yes this can be done, you must use the getOverload trait: https://dlang.org/spec/traits.html#getOverloads The result of this trait is the function itself so it's not hard to use, e.g the result can be passed directly to 'Parameters', 'ReturnType' and such library traits. Awesome, thank you!
Re: Transient ranges
On Tuesday, 31 May 2016 at 21:25:12 UTC, Timon Gehr wrote: On 31.05.2016 22:59, Dicebot wrote: I think we should be aware that the range API doesn't prevent bugs of all kinds. There's only so much analysis the compiler can do. This is a totally valid code I want to actually work and not be discarded as "bug". map often allows random access. Do you suggest it should cache opIndex too? Random access map must store all already evaluated items in memory in mu opinion.
Re: faster splitter
On 05/31/2016 04:18 PM, Chris wrote: I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now. You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei
Re: Transient ranges
On 31.05.2016 22:59, Dicebot wrote: I think we should be aware that the range API doesn't prevent bugs of all kinds. There's only so much analysis the compiler can do. This is a totally valid code I want to actually work and not be discarded as "bug". map often allows random access. Do you suggest it should cache opIndex too?
Re: The Case Against Autodecode
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote: If user code needs to go upper at the grapheme level, they can If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- Andrei Unicode FAQ disagrees (http://unicode.org/faq/utf_bom.html): "Q: How about using UTF-32 interfaces in my APIs? A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16 APIs the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the required functionality at the high levels."
Re: The Case Against Autodecode
On Tue, May 31, 2016 at 05:01:17PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote: > > Wasn't the whole point of operating at the code point level by > > default to make it so that code would be operating on full > > characters by default instead of chopping them up as is so easy to > > do when operating at the code unit level? > > The point is to operate on representation-independent entities > (Unicode code points) instead of low-level representation-specific > artifacts (code units). This is basically saying that we operate on dchar[] by default, except that we disguise its detrimental memory usage consequences by compressing to UTF-8/UTF-16 and incurring the cost of decompression every time we access its elements. Perhaps you love the idea of running an OS that stores all files in compressed form and always decompresses upon every syscall to read(), but I prefer a higher-performance system. > That's the contract, and it seems meaningful > seeing how Unicode is defined in terms of code points as its abstract > building block. Where's this contract stated, and when did we sign up for this? > If user code needs to go lower at the code unit level, they can do so. > If user code needs to go upper at the grapheme level, they can do so. Only with much pain by using workarounds to bypass meticulously-crafted autodecoding algorithms in Phobos. > If anything this thread strengthens my opinion that autodecoding is a > sweet spot. -- Andrei No, autodecoding is a stalemate that's neither fast nor correct. T -- "Real programmers can write assembly code in any language. :-)" -- Larry Wall
Re: The Case Against Autodecode
Am Tue, 31 May 2016 13:06:16 -0400 schrieb Andrei Alexandrescu: > On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote: > > Equality does not require decoding. Similarly, functions like find don't > > either. Something like filter generally would, but it's also not > > particularly normal to filter a string on a by-character basis. You'd > > probably want to get to at least the word level in that case. > > It's nice that the stdlib takes care of that. Both "equality" and "find" require byGrapheme. ⇰ The equivalence algorithm first brings both strings to a common normalization form (NFD or NFC), which works on one grapheme cluster at a time and afterwards does the binary comparison. http://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence ⇰ Find would yield false positives for the start of grapheme clusters. I.e. will match 'o' in an NFD "ö" (simplified example). http://www.unicode.org/reports/tr10/#Searching -- Marco
Re: The Case Against Autodecode
On 31.05.2016 22:20, Marco Leise wrote: Am Tue, 31 May 2016 16:29:33 + schrieb Joakim: >Part of it is the complexity of written language, part of it is >bad technical decisions. Building the default string type in D >around the horrible UTF-8 encoding was a fundamental mistake, >both in terms of efficiency and complexity. I noted this in one >of my first threads in this forum, and as Andrei said at the >time, nobody agreed with me, with a lot of hand-waving about how >efficiency wasn't an issue or that UTF-8 arrays were fine. >Fast-forward years later and exactly the issues I raised are now >causing pain. Maybe you can dig up your old post and we can look at each of your complaints in detail. It is probably this one. Not sure what "exactly the issues" are though. http://forum.dlang.org/thread/bwbuowkblpdxcpyse...@forum.dlang.org
Re: The Case Against Autodecode
On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote: Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level? The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units). That's the contract, and it seems meaningful seeing how Unicode is defined in terms of code points as its abstract building block. If user code needs to go lower at the code unit level, they can do so. If user code needs to go upper at the grapheme level, they can do so. If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- Andrei
Re: Transient ranges
On Tuesday, 31 May 2016 at 18:11:34 UTC, Steven Schveighoffer wrote: 1) Current definition of input range (most importantly, the fact `front` has to be @property-like) implies `front` to always return the same result until `popFront` is called. Regardless of property-like or not, this should be the case. Otherwise, popFront makes no sense. Except it isn't in many cases you call "bugs" :( 2) For ranges that call predicates on elements to evaluate next element this can only be achieved by caching - predicates are never required to be pure. Or, by not returning different things from your predicate. It is perfectly legal for predicate to be non-pure and that would be hugely annoying if anyone decided to prohibit it. Also even pure predicates may be simply very expensive to evaluate which can make `front` a silent pessimization. This is like saying RedBlackTree is broken when I give it a predicate of "a == b". RBL at least makes certain demands about valid predicate can be. This is not case for ranges in general. 3) But caching is sub-optimal performance wise and thus bunch of Phobos algorithms violate `front` consistency / cheapness expectation evaluating predicates each time it is called (liked map). I don't think anything defensively caches front in case the next call to front is different, unless that's specifically the reason for the range. And that makes input ranges violate implication #1 (front stability) casually to the point it can't be relied at all and one has to always make sure it is only evaluated once (make stack local copy or something like that). I think we should be aware that the range API doesn't prevent bugs of all kinds. There's only so much analysis the compiler can do. This is a totally valid code I want to actually work and not be discarded as "bug".
Re: The Case Against Autodecode
On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote: In the vast majority of cases what folks care about is full character How are you so sure? -- Andrei
Re: The Case Against Autodecode
On 05/31/2016 03:34 PM, ag0aep6g wrote: On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote: Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei You got the terms mixed up. Code unit is lower level. Code point is higher level. Apologies and thank you. -- Andrei
Re: The Case Against Autodecode
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote: On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote: Let's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠ The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei Code points I mean. -- Andrei
Re: The Case Against Autodecode
On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote: Let's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠ The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei
Re: Is there any overhead iterating over a pointer using a slice?
On Tuesday, 31 May 2016 at 18:55:18 UTC, Gary Willoughby wrote: If I have a pointer and iterate over it using a slice, like this: T* foo = foreach (element; foo[0 .. length]) { ... } Is there any overhead compared with pointer arithmetic in a for loop? Use the assembly output of your compiler to check! :-) It's fun to look at. For example, with GDC: http://goo.gl/Ur9Srv No difference. cheers, Johan
[Issue 15371] __traits(getMember) should bypass the protection
https://issues.dlang.org/show_bug.cgi?id=15371 --- Comment #4 from b2.t...@gmx.com --- In the meantime, when the trait code is for a struct or a class it's possible to use its '.tupleof' property. It's not affected by the visibility. Instead of all member: import std.meta: aliasSeqOf; import std.range: iota; foreach(i; aliasSeqOf!(iota(0, T.tupleof.length))) { alias MT = typeof(T.tupleof[i]); ... } This is not exactly the same but when the trait code is to inspect the variable types or UDAs it works fine. --
Re: The Case Against Autodecode
On Tuesday, 31 May 2016 at 20:28:32 UTC, ag0aep6g wrote: On 05/31/2016 06:29 PM, Joakim wrote: D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_ Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic. No, this is the root of the problem, but I'm not interested in debating it, so you can go back to discussing how to avoid the elephant in the room.
Re: The Case Against Autodecode
On Tue, May 31, 2016 at 10:38:03PM +0200, Timon Gehr via Digitalmars-d wrote: > On 31.05.2016 21:51, Steven Schveighoffer wrote: > > On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote: > > > On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via > > > Digitalmars-d wrote: > > > [...] > > > > Does walkLength yield the same number for all representations? > > > > > > Let's put the question this way. Given the following string, what > > > do *you* think walkLength should return? > > > > Compiler error. > > > > -Steve > > What about e.g. joiner? joiner is one of those algorithms that can work perfectly fine *without* autodecoding anything at all. The only time it'd actually need to decode would be if you're joining a set of UTF-8 strings with a UTF-16 delimiter, or some other such combination, which should be pretty rare. After all, within the same application you'd usually only be dealing with a single encoding rather than mixing UTF-8, UTF-16, and UTF-32 willy-nilly. (Unless the code is specifically written for transcoding, in which case decoding is part of the job description, so it should be expected that the programmer ought to know how to do it properly without needing Phobos to do it for him.) Even in the case of s.joiner('Ш'), joiner could easily convert that dchar into a short UTF-8 string and then operate directly on UTF-8. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!
Re: Getting the parameters and other attributes belonging to the function overload with the greatest number of arguments
On Tuesday, 31 May 2016 at 20:06:47 UTC, pineapple wrote: I'd like to find the overload of some function with the most parameters and (in this specific case) to get their identifiers using e.g. ParameterIdentifierTuple. There have also been cases where I'd have liked to iterate over the result of Parameters!func for each overload of that function. Can this be done, and if so how? Yes this can be done, you must use the getOverload trait: https://dlang.org/spec/traits.html#getOverloads The result of this trait is the function itself so it's not hard to use, e.g the result can be passed directly to 'Parameters', 'ReturnType' and such library traits.
Re: The Case Against Autodecode
On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote: Am Tue, 31 May 2016 16:29:33 + schrieb Joakim: Part of it is the complexity of written language, part of it is bad technical decisions. Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity. I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain. Maybe you can dig up your old post and we can look at each of your complaints in detail. Not interested. I believe you were part of that thread then. Google it if you want to read it again. UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched. You don't download twice the data. First of all, some languages had two-byte encodings before UTF-8, and second web content is full of HTML syntax and gzip compressed afterwards. The vast majority can be encoded in a single byte, and are unnecessarily forced to two bytes by the inefficient UTF-8/16 encodings. HTML syntax is a non sequitur; compression helps but isn't as efficient as a proper encoding. Take this Thai Wikipedia entry for example: https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2 The download of the gzipped html is 11% larger in UTF-8 than in Thai TIS-620 single-byte encoding. And that is dwarfed by the size of JS + images. (I don't have the numbers, but I expect the effective overhead to be ~2%). Nobody on a 2G connection is waiting minutes to download such massive web pages. They are mostly sending text to each other on their favorite chat app, and waiting longer and using up more of their mobile data quota if they're forced to use bad encodings. Ironically a lot of symbols we take for granted would then have to be implemented as HTML entities using their Unicode code points(sic!). Amongst them basic stuff like dashes, degree (°) and minute (′), accents in names, non-breaking space or footnotes (↑). No, they just don't use HTML, opting for much superior mobile apps instead. :) D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_ That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like ° or α,β,γ. Let's see: a constant-time addition to a header or constantly decoding every character every time I want to manipulate the string... I wonder which is a better choice?! You would not "intersperse" any other encodings, unless you kept track of those substrings in the header. My whole point is that such mixing of languages or "extra symbols" is an extreme minority use case: the vast majority of strings are a single language. The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness. No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either. Unicode _is_ a retro codepage system, they merely standardized a bunch of the most popular codepages. So that's not going away no matter what system you use. :) Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language. UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile.
Re: Reddit announcements
On Tuesday, 31 May 2016 at 19:33:46 UTC, John Colvin wrote: On Tuesday, 31 May 2016 at 18:57:29 UTC, o-genki-desu-ka wrote: Many nice announcements here last week. I put some on reddit. https://www.reddit.com/r/programming/comments/4lwufi/d_embedded_database_v01_released/ https://www.reddit.com/r/programming/comments/4lwubv/c_to_d_converter_based_on_clang/ https://www.reddit.com/r/programming/comments/4lwu5p/coedit_2_ide_update_6_released/ https://www.reddit.com/r/programming/comments/4lwtxw/compiletime_sqlite_for_d_beta_release/ https://www.reddit.com/r/programming/comments/4lwtr0/button_a_fast_correct_and_elegantly_simple_build/ https://www.reddit.com/r/programming/comments/4lwtn9/first_release_of_powernex_an_os_kernel_written_in/ I'm a bit concerned that people will react negatively to them all being dumped at once. Same here, moreover while some annoncements are about "ready to show" projects (button or powernex for example) others like "D embedded database" clearly are too young not to annoye /programming/ people IMHO.
Re: Variables should have the ability to be @nogc
On Tuesday, 31 May 2016 at 19:04:39 UTC, Marco Leise wrote: Am Tue, 31 May 2016 15:53:44 + schrieb Basile B.: This solution seems smarter than using the existing '@nogc' attribute. Plus one also for the fact that nothing has to be done in DMD. I just constrained myself to what can be done in user code from the start. :) Did you encounter the issue with protected and private members ? For me when i've tested the template i've directly got some warnings. DMD interprets my 'getMember' calls as a deprecated abuse of bug 314 but in dmd 2.069 I would get true errors. Actually it is in a large half-ported code base from C++ and I haven't ever had a running executable, nor did I test it with recent dmd versions. My idea was to mostly have @nogc code, but allow it for a transition time or places where GC use does not have an impact. Here is the code, free to use for all purposes. Thx for sharing the template. When using '.tupleof' instead of the traits 'allMember'/'getMember' there's no issue with the visibility, which is awesome. It means that the template can be proposed very quickly in phobos. The only thing is that I'm not sure about is the tri-state and the recursion. I cannot find a case where it would be justified.
Re: The Case Against Autodecode
On Tue, May 31, 2016 at 10:47:56PM +0300, Dmitry Olshansky via Digitalmars-d wrote: > On 31-May-2016 01:00, Walter Bright wrote: > > On 5/30/2016 11:25 AM, Adam D. Ruppe wrote: > > > I don't agree on changing those. Indexing and slicing a char[] is > > > really useful and actually not hard to do correctly (at least with > > > regard to handling code units). > > > > Yup. It isn't hard at all to use arrays of codeunits correctly. > > Ehm as long as all you care for is operating on substrings I'd say. > Working with individual character requires either decoding or clever > tricks like operating on encoded UTF directly. [...] Working on individual characters needs byGrapheme, unless you know beforehand that the character(s) you're working with are ASCII, or fits in a single code unit. About "clever tricks", it's not really that hard. I was thinking that things like s.canFind('Ш') should translate the 'Ш' into a UTF-8 byte sequence, and then do a substring search directly on the encoded string. This way, a large number of single-character algorithms don't even need to decode. The way UTF-8 is designed guarantees that there will not be any false positives. This will eliminate a lot of the current overhead of autodecoding. T -- Klein bottle for rent ... inquire within. -- Stephen Mulraney
Re: The Case Against Autodecode
On 31.05.2016 21:51, Steven Schveighoffer wrote: On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote: On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...] Does walkLength yield the same number for all representations? Let's put the question this way. Given the following string, what do *you* think walkLength should return? Compiler error. -Steve What about e.g. joiner?
Re: The Case Against Autodecode
On 05/31/2016 06:29 PM, Joakim wrote: D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_ Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic.
Re: D Embedded Database v0.1 Released
On Saturday, 28 May 2016 at 14:08:18 UTC, Piotrek wrote: Short description A database engine for quick and easy integration into any D program. Full compatibility with D types and ranges. Design Goals (none is accomplished yet) - ACID - No external dependencies - Single file storage - Multithread support - Suitable for microcontrollers Example code: import draft.database; import std.stdio; void main(string[] args) { static struct Test { int a; string s; } auto db = DataBase("testme.db"); auto collection = db.collection!Test("collection_name",true); collection.put(Test(1,"Hello DB")); writeln(db.collection!Test("collection_name")); } More info for interested at: Docs: https://gitlab.com/PiotrekDlang/DraftLib/blob/master/docs/database/index.md Code: https://gitlab.com/PiotrekDlang/DraftLib/tree/master/src The project is at its early stage of development. Piotrek This might provide useful information if you're aiming for something like sqlite (hopefully not offtopic): https://github.com/cznic/ql It's an embeddable database engine in Go with goals similar to yours and at an advanced stage. regards, dmitri.
Re: The Case Against Autodecode
On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote: On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote: UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched. Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere anytime soon - if ever. The Win32 API does use UTF-16, and Java and C# do, but vast sea of code that is C or C++ generally uses UTF-8 as do plenty of other programming languages. I agree that both UTF encodings are somewhat popular now. And even aside from English, most European languages are going to be more efficient with UTF-8, because they're still primarily ASCII even if they contain characters that are not. Stuff like Chinese is definitely worse in UTF-8 than it would be in UTF-16, but there are a lot of languages other than English which are going to encode better with UTF-8 than UTF-16 - let alone UTF-32. And there are a lot more languages that will be twice as long than English, ie ASCII. Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it for it to be going anywhere, and most folks have no problem with that. Any attempt to get rid of it would be a huge, uphill battle. I disagree, it is inevitable. Any tech so complex and inefficient cannot last long. But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving the standard library - so anyone who wants to avoid UTF-8 is free to do so. Yes, but not by using UTF-16/32, which use too much memory. I've suggested a single-byte encoding for most languages instead, both in my last post and the earlier thread. D could use this new encoding internally, while keeping its current UTF-8/16 strings around for any outside UTF-8/16 data passed in. Any of that data run through algorithms that don't require decoding could be kept in UTF-8, but the moment any decoding is required, D would translate UTF-8 to the new encoding, which would be much easier for programmers to understand and manipulate. If UTF-8 output is needed, you'd have to encode back again. Yes, this translation layer would be a bit of a pain, but the new encoding would be so much more efficient and understandable that it would be worth it, and you're already decoding and encoding back to UTF-8 for those algorithms now. All that's changing is that you're using a new and different encoding than dchar as the default. If it succeeds for D, it could then be sold more widely as a replacement for UTF-8/16. I think this would be the right path forward, not navigating this UTF-8/16 mess further.
Re: The Case Against Autodecode
Am Tue, 31 May 2016 16:29:33 + schrieb Joakim: > Part of it is the complexity of written language, part of it is > bad technical decisions. Building the default string type in D > around the horrible UTF-8 encoding was a fundamental mistake, > both in terms of efficiency and complexity. I noted this in one > of my first threads in this forum, and as Andrei said at the > time, nobody agreed with me, with a lot of hand-waving about how > efficiency wasn't an issue or that UTF-8 arrays were fine. > Fast-forward years later and exactly the issues I raised are now > causing pain. Maybe you can dig up your old post and we can look at each of your complaints in detail. > UTF-8 is an antiquated hack that needs to be eradicated. It > forces all other languages than English to be twice as long, for > no good reason, have fun with that when you're downloading text > on a 2G connection in the developing world. It is unnecessarily > inefficient, which is precisely why auto-decoding is a problem. > It is only a matter of time till UTF-8 is ditched. You don't download twice the data. First of all, some languages had two-byte encodings before UTF-8, and second web content is full of HTML syntax and gzip compressed afterwards. Take this Thai Wikipedia entry for example: https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2 The download of the gzipped html is 11% larger in UTF-8 than in Thai TIS-620 single-byte encoding. And that is dwarfed by the size of JS + images. (I don't have the numbers, but I expect the effective overhead to be ~2%). Ironically a lot of symbols we take for granted would then have to be implemented as HTML entities using their Unicode code points(sic!). Amongst them basic stuff like dashes, degree (°) and minute (′), accents in names, non-breaking space or footnotes (↑). > D devs should lead the way in getting rid of the UTF-8 encoding, > not bickering about how to make it more palatable. I suggested a > single-byte encoding for most languages, with double-byte for the > ones which wouldn't fit in a byte. Use some kind of header or > other metadata to combine strings of different languages, _rather > than encoding the language into every character!_ That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like ° or α,β,γ. > The common string-handling use case, by far, is strings with only > one language, with a distant second some substrings in a second > language, yet here we are putting the overhead into every > character to allow inserting characters from an arbitrary > language! This is madness. No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either. > Yes, the complexity of diacritics and combining characters will > remain, but that is complexity that is inherent to the variety of > written language. UTF-8 is not: it is just a bad technical > decision, likely chosen for ASCII compatibility and some > misguided notion that being able to combine arbitrary language > strings with no other metadata was worthwhile. It is not. The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols. -- Marco
Re: asm woes...
On Tuesday, 31 May 2016 at 18:52:16 UTC, Marco Leise wrote: The 'this' pointer is usually in some register already. On Linux 32-bit for example it is in EAX, on Linux 64-bit is in RDI. The AX register seems like a bad choice, since you require the AX/DX registers when you do multiplication and division (although all other registers are general purpose some instructions are still tied to specific registers). SI/DI are a much better choice. By the way, you are right that 32-bit does not have access to 64-bit machine words (actually kind of obvious), but your idea wasn't far fetched, since there is the X32 architecture at least for Linux. It uses 64-bit machine words, but 32-bit pointers and allows for compact and fast programs. As i recall the switch to use the larger registers is a simple switch per instruction, something like either 60h, 66h or 67h. I forget which one exactly, as i recall writing assembly programs using 16bit DOS but using 32bit registers using that trick (built into the assembler). Although to use the lower registers by themselves required the same switch, so...
Re: faster splitter
On Tuesday, 31 May 2016 at 19:59:50 UTC, qznc wrote: On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote: Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler? LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :) Yep. It's really impressive. I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now. `Adrei2` is that it performs consistently well.
Re: Free the DMD backend
I have no idea how licensing would work in that regard but considering that DMDs backend is actively maintained and may eventually even be ported to D, wouldn't it at some point differ enough from Symantecs "original" backend to simply call the DMD backend its own thing? Or are all the changes to the DMD backend simply changes to Symantecs backend period? Then again even if that'd legally be fine after some point, someone would have to make the judgement call and that seems like a potentially large legal risk, so I guess even if it'd work that way it would be an unrealistic step.
Re: Free the DMD backend
On Tue, 2016-05-31 at 10:09 +, Atila Neves via Digitalmars-d wrote: > […] > > No, no, no, no. We had LDC be the default already on Arch Linux > for a while and it was a royal pain. I want to choose to use LDC > when and if I need performance. Otherwise, I want my projects to > compile as fast possible and be able to use all the shiny new > features. So write a new backend for DMD the licence of which allows DMD to be in Debian and Fedora. -- Russel. = Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.win...@ekiga.net 41 Buckmaster Roadm: +44 7770 465 077 xmpp: rus...@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder signature.asc Description: This is a digitally signed message part
Getting the parameters and other attributes belonging to the function overload with the greatest number of arguments
I'd like to find the overload of some function with the most parameters and (in this specific case) to get their identifiers using e.g. ParameterIdentifierTuple. There have also been cases where I'd have liked to iterate over the result of Parameters!func for each overload of that function. Can this be done, and if so how?
Re: The Case Against Autodecode
On Tuesday, May 31, 2016 22:47:56 Dmitry Olshansky via Digitalmars-d wrote: > On 31-May-2016 01:00, Walter Bright wrote: > > On 5/30/2016 11:25 AM, Adam D. Ruppe wrote: > >> I don't agree on changing those. Indexing and slicing a char[] is > >> really useful > >> and actually not hard to do correctly (at least with regard to > >> handling code > >> units). > > > > Yup. It isn't hard at all to use arrays of codeunits correctly. > > Ehm as long as all you care for is operating on substrings I'd say. > Working with individual character requires either decoding or clever > tricks like operating on encoded UTF directly. Yeah, but Phobos provides the tools to do that reasonably easily even when autodecoding isn't involved. Sure, it's slightly more tedious to call std.utf.decode or std.utf.encode yourself rather than letting autodecoding take care of it, but it's easy enough to do and allows you to control when it's done. And we have stuff like byChar!dchar or byGrapheme for the cases where you don't want to actually operate on arrays of code units. - Jonathan M Davis
Re: The Case Against Autodecode
On Tuesday, May 31, 2016 21:48:36 Timon Gehr via Digitalmars-d wrote: > On 31.05.2016 21:40, Wyatt wrote: > > On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote: > >> The 'length' of a character is not one in all contexts. > >> The following text takes six columns in my terminal: > >> > >> 日本語 > >> 123456 > > > > That's a property of your font and font rendering engine, not Unicode. > > Sure. Hence "context". If you are e.g. trying to manually underline some > text in console output, for example in a compiler error message, > counting the number of characters will not actually be what you want, > even though it works reliably for ASCII text. > > > (Also, it's probably not quite six columns; most fonts I've tested, 漢字 > > are rendered as something like 1.5 characters wide, assuming your > > terminal doesn't overlap them.) > > > > -Wyatt > > It's precisely six columns in my terminal (also in emacs and in gedit). > > My point was, how can std.algorithm ever guess correctly what you > /actually/ intended to do? It can't, which is precisely why having it select for you was a bad design decision. The programmer needs to be making that decision. And the fact that Phobos currently makes that decision for you means that it's often doing the wrong thing, and the fact that it chose to decode code points by default means that it's often eating up unnecessary cycles to boot. - Jonathan M Davis
Re: The Case Against Autodecode
On Tuesday, May 31, 2016 15:33:38 Andrei Alexandrescu via Digitalmars-d wrote: > On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote: > > walkLength treats a code point like it's a character. > > No, it treats a code point like it's a code point. -- Andrei Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level? Thanks to how Phobos treats strings as ranges of dchar, most D code treats code points as if they were characters. So, whether it's correct or not, a _lot_ of D code is treating walkLength like it returns the number of characters in a string. And if walkLength doesn't provide the number of characters in a string, why would I want to use it under normal circumstances? Why would I want to be operating at the code point level in my code? It's not necessarily a full character, since it's not necessarily a grapheme. So, by using walkLength and front and popFront and whatnot with strings, I'm not getting full characters. I'm still only getting pieces of characters - just like would happen if strings were treated as ranges of code units. I'm just getting bigger pieces of the characters out of the deal. But if they're not full characters, what's the point? I am sure that there is code that is going to want to operate at the code point level, but your average program is either operating on strings as a whole or individual characters. As long as strings are being operated on as a whole, code units are generally plenty, and careful encoding of characters into code units for comparisons means that much of the time that you want to operate on individual characters, you can still operate at the code unit level. But if you can't, then you need the grapheme level, because a code point is not necessarily a full character. So, what is the point of operating on code points in your average D program? walkLength will not always tell me the number of characters in a string. front risks giving me a partial character rather than a whole one. Slicing dchar[] risks chopping up characters just like slicing char[] does. Operating on code points by default does not result in correct string processing. I honestly don't see how autodecoding is defensible. We may not be able to get rid of it due to the breakage that doing that would cause, but I fail to see how it is at all desirable that we have autodecoded strings. I can understand how we got it if it's based on a misunderstanding on your part about how Unicode works. We all make mistakes. But I fail to see how autodecoding wasn't a mistake. It's the worst of both worlds - inefficient while still incorrect. At least operating at the code unit level would be fast while being incorrect, and it would be obviously incorrect once you did anything with non-ASCII values, whereas it's easy to miss that ranges of dchar are doing the wrong thing too - Jonathan M Davis
Re: faster splitter
On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote: Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler? LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :)
Re: The Case Against Autodecode
On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote: On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...] Does walkLength yield the same number for all representations? Let's put the question this way. Given the following string, what do *you* think walkLength should return? Compiler error. -Steve
Re: The Case Against Autodecode
On Tue, May 31, 2016 at 07:40:13PM +, Wyatt via Digitalmars-d wrote: > On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote: > > > > The 'length' of a character is not one in all contexts. > > The following text takes six columns in my terminal: > > > > 日本語 > > 123456 > > That's a property of your font and font rendering engine, not Unicode. > (Also, it's probably not quite six columns; most fonts I've tested, > 漢字 are rendered as something like 1.5 characters wide, assuming your > terminal doesn't overlap them.) [...] I believe he was talking about a console terminal that uses 2 columns to render the so-called "double width" characters. The CJK block does contain "double-width" versions of selected blocks (e.g., the ASCII block), to be used with said characters. Of course, using string length to measure string width is a risky venture fraught with pitfalls, because your terminal may not actually render them the way you think it should. Nevertheless, it does serve to highlight why a construct like s.walkLength is essentially buggy, because there is not enough information to determine which length it should return -- length of the buffer in bytes, or the number of code points, or the number of graphemes, or the width of the string. No matter which choice you make, it only works for a subset of cases and is wrong for the other cases. This is a prime illustration of why forcing autodecoding on every string in D is a wrong design. T -- Не дорог подарок, дорога любовь.
Re: The Case Against Autodecode
On 31.05.2016 21:40, Wyatt wrote: On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote: The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456 That's a property of your font and font rendering engine, not Unicode. Sure. Hence "context". If you are e.g. trying to manually underline some text in console output, for example in a compiler error message, counting the number of characters will not actually be what you want, even though it works reliably for ASCII text. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.) -Wyatt It's precisely six columns in my terminal (also in emacs and in gedit). My point was, how can std.algorithm ever guess correctly what you /actually/ intended to do?
Re: The Case Against Autodecode
On 31-May-2016 01:00, Walter Bright wrote: On 5/30/2016 11:25 AM, Adam D. Ruppe wrote: I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units). Yup. It isn't hard at all to use arrays of codeunits correctly. Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly. -- Dmitry Olshansky
Re: The Case Against Autodecode
On Tuesday, May 31, 2016 21:20:19 Timon Gehr via Digitalmars-d wrote: > On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote: > > On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote: > >> >On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote: > >>> > >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via > >>> > >Digitalmars-d > > > > wrote: > > >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote: > > > >>>Saying that operating at the code point level - UTF-32 - is > > > >>>correct > > > >>>is like saying that operating at UTF-16 instead of UTF-8 is > > > >>>correct. > > >> > > >>Could you please substantiate that? My understanding is that code > > >>unit > > >>is a higher-level Unicode notion independent of encoding, whereas > > >>code > > >>point is an encoding-dependent representation detail. -- Andrei > >> > > >> >Does walkLength yield the same number for all representations? > > > > walkLength treats a code point like it's a character. My point is that > > that's incorrect behavior. It will not result in correct string processing > > in the general case, because a code point is not guaranteed to be a > > full character. > > ... > > What's "correct"? Maybe the user intended to count the number of code > points in order to pre-allocate a dchar[] of the correct size. > > Generally, I don't see how algorithms become magically "incorrect" when > applied to utf code units. In the vast majority of cases what folks care about is full characters, which is not what code points are. But the fact that they want different things in different situation just highlights the fact that just converting everything to code points by default is a bad idea. And even worse, code points are usually the worst choice. Many operations don't require decoding and can be done at the code unit level, meaning that operating at the code point level is just plain inefficient. And the vast majority of the operations that can't operate at the code point level, then need to operate on full characters, which means that they need to be operating at the grapheme level. Code points are in this weird middle ground that's useful in some cases but usually isn't what you want or need. We need to be able to operate at the code unit level, the code point level, and the grapheme level. But defaulting to the code point level really makes no sense. > > walkLength does not report the length of a character as one in all cases > > just like length does not report the length of a character as one in all > > cases. walkLength is counting bigger units than length, but it's still > > counting pieces of a character rather than counting full characters. > > The 'length' of a character is not one in all contexts. > The following text takes six columns in my terminal: > > 日本語 > 123456 Well, that's getting into displaying characters which is a whole other can of worms, but it also highlights that assuming that the programmer wants a particular level of unicode is not a particularly good idea and that we should avoid converting for them without being asked, since it risks being inefficient to no benefit. - Jonathan M Davis
Re: The Case Against Autodecode
On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote: The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456 That's a property of your font and font rendering engine, not Unicode. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.) -Wyatt
[Issue 16090] popFront generates out-of-bounds array index on corrupted utf-8 strings
https://issues.dlang.org/show_bug.cgi?id=16090 --- Comment #2 from github-bugzi...@puremagic.com --- Commits pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/e1af1b0b51ea9f29d4ff8076d73c03ba10bfc73c fix issue 16090 - popFront generates out-of-bounds array index on corrupted utf-8 strings https://github.com/dlang/phobos/commit/279ccd7c5c8cebfb21a3138aecf7f3a85444e538 Merge pull request #4387 from aG0aep6G/16090 fix issue 16090 - popFront generates out-of-bounds array index on cor… --
Re: The Case Against Autodecode
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...] > Does walkLength yield the same number for all representations? Let's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠ I think any reasonable person would have to say it should return 5, because there are 5 visual "characters" here. Otherwise, what is even the meaning of walkLength?! For it to return anything other than 5 means that it's a leaky abstraction, because it's leaking low-level "implementation details" of the Unicode representation of this string. However, with the current implementation of autodecoding, walkLength returns 11. Can anyone reasonably argue that it's reasonable for "şŭt̥ḛ́k̠".walkLength to equal 11? What difference does this make if we get rid of autodecoding, and walkLength returns 17 instead? *Both* are wrong. 17 is actually the right answer if you're looking to allocate a buffer large enough to hold this string, because that's the number of bytes it occupies. 5 is the right answer to an end user who knows nothing about Unicode. 11 is an answer that a question that only makes sense to a Unicode specialist, and that no layperson understands. 11 is the answer we currently give. And that, at the cost of across-the-board performance degradation. Yet you're seriously arguing that 11 should be the right answer, by insisting that the current implementation of autodecoding is "correct". It boggles the mind. T -- Today's society is one of specialization: as you grow, you learn more and more about less and less. Eventually, you know everything about nothing.
Re: The Case Against Autodecode
On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote: Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei You got the terms mixed up. Code unit is lower level. Code point is higher level. One code point is encoded with one or more code units. char is a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is both a UTF-32 code unit and a code point, because in UTF-32 it's a 1-to-1 relation.
Re: The Case Against Autodecode
On 05/31/2016 02:57 PM, Jonathan M Davis via Digitalmars-d wrote: In addition, as soon as you have ubyte[], none of the string-related functions work. That's fixable, but as it stands, operating on ubyte[] instead of char[] is a royal pain. That'd be nice to fix indeed. Please break the ground? -- Andrei
Re: The Case Against Autodecode
On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote: walkLength treats a code point like it's a character. No, it treats a code point like it's a code point. -- Andrei
Re: The Case Against Autodecode
On 05/31/2016 02:46 PM, Timon Gehr wrote: On 31.05.2016 20:30, Andrei Alexandrescu wrote: D's Phobos' foreach, too. -- Andrei
Re: Reddit announcements
On Tuesday, 31 May 2016 at 18:57:29 UTC, o-genki-desu-ka wrote: Many nice announcements here last week. I put some on reddit. https://www.reddit.com/r/programming/comments/4lwufi/d_embedded_database_v01_released/ https://www.reddit.com/r/programming/comments/4lwubv/c_to_d_converter_based_on_clang/ https://www.reddit.com/r/programming/comments/4lwu5p/coedit_2_ide_update_6_released/ https://www.reddit.com/r/programming/comments/4lwtxw/compiletime_sqlite_for_d_beta_release/ https://www.reddit.com/r/programming/comments/4lwtr0/button_a_fast_correct_and_elegantly_simple_build/ https://www.reddit.com/r/programming/comments/4lwtn9/first_release_of_powernex_an_os_kernel_written_in/ I'm a bit concerned that people will react negatively to them all being dumped at once.
Re: faster splitter
On Tuesday, 31 May 2016 at 18:56:14 UTC, qznc wrote: The mistake is to split on "," instead of ','. The slow splitter at the start of this thread searches for "\r\n". Your lazy-skip algorithm looks great! ./benchmark.ldc Search in Alice in Wonderland std: 168 ±6 +29 ( 107) -3 ( 893) manual: 112 ±3 +28 ( 81) -1 ( 856) qznc: 149 ±4 +30 ( 79) -1 ( 898) Chris: 142 ±5 +28 ( 102) -2 ( 898) Andrei: 153 ±3 +28 ( 79) -1 ( 919) Andrei2: 101 ±2 +34 ( 31) -1 ( 969) Search random haystack with random needle std: 172 ±19+61 ( 161) -11 ( 825) manual: 161 ±47+73 ( 333) -35 ( 666) qznc: 163 ±21+33 ( 314) -15 ( 661) Chris: 190 ±47+80 ( 302) -33 ( 693) Andrei: 140 ±37+60 ( 315) -27 ( 676) Andrei2: 103 ±6 +57 ( 64) -2 ( 935) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz The Alice benchmark searches Alice in Wonderland for "find a pleasure in all their simple joys" and finds it in the last sentence. Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler?
Re: The Case Against Autodecode
On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote: On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote: >On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote: > >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote: > >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote: > >>>Saying that operating at the code point level - UTF-32 - is correct > >>>is like saying that operating at UTF-16 instead of UTF-8 is correct. > >> > >>Could you please substantiate that? My understanding is that code unit > >>is a higher-level Unicode notion independent of encoding, whereas code > >>point is an encoding-dependent representation detail. -- Andrei > > >Does walkLength yield the same number for all representations? walkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. ... What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size. Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units. walkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters. The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456
Re: Variables should have the ability to be @nogc
Am Tue, 31 May 2016 15:53:44 + schrieb Basile B.: > This solution seems smarter than using the existing '@nogc' > attribute. Plus one also for the fact that nothing has to be done > in DMD. I just constrained myself to what can be done in user code from the start. :) > Did you encounter the issue with protected and private members ? > > For me when i've tested the template i've directly got some > warnings. DMD interprets my 'getMember' calls as a deprecated > abuse of bug 314 but in dmd 2.069 I would get true errors. Actually it is in a large half-ported code base from C++ and I haven't ever had a running executable, nor did I test it with recent dmd versions. My idea was to mostly have @nogc code, but allow it for a transition time or places where GC use does not have an impact. Here is the code, free to use for all purposes. enum GcScan { no, yes, automatic } enum noScan = GcScan.no; template gcScanOf(T) { import std.typetuple; static if (is(T == struct) || is(T == union)) { enum isGcScan(alias uda) = is(typeof(uda) == GcScan); GcScan findGcScan(List...)() { auto result = GcScan.automatic; foreach (attr; List) if (is(typeof(attr) == GcScan)) result = attr; return result; } enum gcScanOf() { auto result = GcScan.no; foreach (i; Iota!(T.tupleof.length)) { enum memberGcScan = findMatchingUda!(T.tupleof[i], isGcScan, true); static if (memberGcScan.length == 0) enum eval = gcScanOf!(typeof(T.tupleof[i])); else enum eval = evalGcScan!(memberGcScan, typeof(T.tupleof[i])); static if (eval) { result = eval; break; } } return result; } } else { static if (isStaticArray!T && is(T : E[N], E, size_t N)) enum gcScanOf = is(E == void) ? GcScan.yes : gcScanOf!E; else enum gcScanOf = hasIndirections!T ? GcScan.yes : GcScan.no; } } enum evalGcScan(GcScan gc, T) = (gc == GcScan.automatic) ? gcScanOf!T : gc; template findMatchingUda(alias symbol, alias func, bool optional = false, bool multiple = false) { import std.typetuple; enum symbolName = __traits(identifier, symbol); enum funcName = __traits(identifier, func); template Filter(List...) { static if (List.length == 0) alias Filter = TypeTuple!(); else static if (__traits(compiles, func!(List[0])) && func!(List[0])) alias Filter = TypeTuple!(List[0], Filter!(List[1 .. $])); else alias Filter = Filter!(List[1 .. $]); } alias filtered = Filter!(__traits(getAttributes, symbol)); static assert(filtered.length <= 1 || multiple, symbolName ~ " may only have one UDA matching " ~ funcName ~ "."); static assert(filtered.length >= 1 || optional, symbolName ~ " requires a UDA matching " ~ funcName ~ "."); static if (multiple || optional) alias findMatchingUda = filtered; else static if (filtered.length == 1) alias findMatchingUda = filtered[0]; } -- Marco
Is there any overhead iterating over a pointer using a slice?
In relation to this thread: http://forum.dlang.org/thread/ddckhvcxlyuvuiyaz...@forum.dlang.org Where I asked about slicing a pointer, I have another question: If I have a pointer and iterate over it using a slice, like this: T* foo = foreach (element; foo[0 .. length]) { ... } Is there any overhead compared with pointer arithmetic in a for loop?