Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Dmitry Olshansky Thu, 08 Mar 2012 23:55:10 -0800

On 09.03.2012 1:12, Jonathan M Davis wrote:

On Friday, March 09, 2012 00:54:48 Dmitry Olshansky wrote:

On 08.03.2012 22:46, Jonathan M Davis wrote:

On Thursday, March 08, 2012 22:03:12 Dmitry Olshansky wrote:

On 08.03.2012 11:48, Jonathan M Davis wrote:

A range is not necessarily a dynamic array, though a dynamic array is a
range. The lexer is going to need to take a range of dchar (which may or
may not be an array), and it's probably going to need to return a range
of tokens. The parser would then take a range of tokens and then output
the AST in some form or other - it probably couldn't be range, but I'm
not sure. And while the lexer would need to operate on generic ranges of
dchar, it would probably have to be special-cased for strings in a
number
of places in order to make it faster (e.g. checking the first char in a
string rather than using front when it's known that the value being
checked against is an ASCII character and will therefore fit in a single
char - front has to decode the next character, which is less efficient).


Simply put, the decisison on decoding should belong to lexer. Thus
strings should be wrapped as input range of char, wchar&  dchar
respectively.


??? The normal way to handle this is to simply special-case certain
operations. e.g.

static if(Unqual!(isElementEncodingType!R) == char)
{ ... }


Does isElementEncodingType aware of anything other then string/wstring?


No. It uses isNarrowString. So, what you'd end up doing is having the lexer
accept a generic range of dchar and then have specializations where
appropriate for narrow strings. Nothing in Phobos' string handling really
supports the idea of dealing with generic char or wchar ranges. It all
processes ranges of dchar and specializes on narrow strings where appropriate.

The concept that *only* strings need special casing is broken. I recallsomebody already stomped on this: i.e. any range that returns char (awrapper range, over say memory-mapped file) passes by all the intricatespecial casing that does decoding of std.algorithm and friends. So toput it simply there is no way to tap into this *decoding by default*infrastructure.

But is there really a use case for a generic range of char or wchar? I don't
know. In general, I really don't think that there is.

Memory mapped file wrapper is one, it's just from the top of my head.There could be plenty of them, one needs just to look. 'I don't

 know' is not a solid argument, sorry.

When dealing with ranges

of characters, they're essentially always either strings or strings which have
been wrapped in other ranges (generally by calling some other range-based
function such as take or filter). And those range-based functions pretty much
inevitably need to treat the strings as ranges of dchar to do what they do
(potentially with specific optimizations for strings). They aren't designed to
operate on ranges of char or wchar, and the only reason to make them do so
would be if there were a use case where you needed a range of char or wchar
which was not an array. But they're all either arrays or wrapped arrays. So,
no such use case currently exists with Phobos, and I really question the
usefulness of trying to optimize on a generic range of char or wchar -
especially when many of the optimizations for arrays involve random access
ranges, and if you have a random access range of char or wchar, I _really_
question that it would ever be anything other than an array.

And that is problem, if you fail to see why we need to stop pretendingthat all char/wchar ranges are arrays, then I failed somewhere. In thesame vane one would argue that there is no other random access range butarray, yet there is. For instance if we explore containers there couldbe Tries, Dequeues and whatnot of char/wchar with their respective ranges.

So, I'd advise against trying to operate on ranges of char or wchar and just
stick to operating on ranges of dchar with optimizations for narrow strings
where appropriate.

Probably the fact that in lexer it's not 'some place for optimizations,it's the whole thing' was missed. That's why I'm looking for more orless generic yet efficient way.

Now, speaking outside of this specific problem.
Basically I would propose formalizing a kind of range that current
string/wstring is. And that is a VariableLengthEncoding range (VLE
range), a two in one - random access codeunit range and bidirectional
'codepoint' range. I've heard of attempts on this concept before, but
now with a use case at hand it could be become more important.


There has been some talk of VLE ranges but really only with regards to the
fact that strings are a case of that. Nothing has been done to generalize it.
It may be something that should be looked into, but until we've formalized
that, I don't think that something like the lexer should be built around the
idea. It would be safer to stick with what we've been doing - operating on
ranges of dchar and special-casing for narrow strings where appropriate. If
need be, it should be possible to adjust it to use VLEs later.

The problem is, I think, that current InputRange range is insufficent as
it requres to calculate length of first element twice: one time in front
and extra one in popFront.


Perhaps an optional function should be added to input ranges which both
returns and front and pops it. But VLE ranges such as narrow strings are
probably the only ones which would really care, and we can already special
case for strings, so it would probably only be of real value for general VLEs.

The goal is to make std.algorithm general when it comes to UTF-x ranges,VLE range seems a best suited abstraction level so far. Other thingslike base64 encoded stuff could be there, though it needs some thought.

However, since VLEs would arguably be a new range type, they could just
implement the new function, and anything which special-cases VLEs can take
advantage of it, while they still work just fine as input ranges. We could even
add a free function which both pops and returns front which uses front and
popFront for most ranges and the new function for VLEs, so you can code with
that free function in cases where you intend to both take front and pop it off
without caring about what type of range you're dealing with and without
forcing ranges in general to implement an extra function.

There's definitely more work and designing to be done here, but there are
definitely possibilities.

Still, in the interim, I'd advise simply special casing on narrow strings in
the lexer rather than trying to use VLEs initially.

Aye, and write 2 copies of lexer, yikes. I would just ignore *default*array ranges for the moment and do whatever is convenient & fast. Thenconvince people to provide proper abstraction.The problem of *default* ranges is that they have to do extra work tomaintain their invariants, e.g. even array should adjust it's length.Like OP mentioned popFront is --length, ++ptr; whereas ptr++ is allrequired. So clearly a wrapper of array that doesn't need to adjustlength could be faster. A-la:

struct Wrapped(E){
private://not visible
        E[] arr;
        size_t idx;
public:
        @property auto front(){ arr[idx]; }
...
}
Or even 2 pointer (current+end) version.

--
Dmitry Olshansky

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Reply via email to