Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Tue, 18 Jan 2011 09:15:50 -0800

On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu<seewebsiteforem...@erdani.org> said:

On 1/18/11 7:17 AM, Michel Fortin wrote:

On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> said:

On 1/17/11 9:48 PM, Michel Fortin wrote:

On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.for...@michelf.com>
said:

More seriously, you have four choice:

1. code unit
2. code point
3. grapheme
4. require the client to state explicitly which kind of 'character' he
wants; 'character' being an overloaded word, it's reasonable to ask
for disambiguation.


This makes me think of what I did with my XML parser after you made code
points the element type for strings. Basically, the parser now uses
'front' and 'popFront' whenever it needs to get the next code point, but
most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
had to add) when testing for or skipping an ASCII character is
sufficient. This way I avoid a lot of unnecessary decoding of code
points.

For this to work, the same range must let you skip either a unit or a
code point. If I were using a separate range with a call to toDchar or
toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
have helped much because the new range would essentially become a new
slice independent of the original, so you can't interleave "I want to
advance by one unit" with "I want to advance by one code point".

So perhaps the best interface for strings would be to provide multiple
range-like interfaces that you can use at the level you want.

I'm not sure if this is a good idea, but I thought I should at least
share my experience.


Very insightful. Thanks for sharing. Code it up and make a solid
proposal!


What I use right now is this (see below). I'm not sure what would be a
good name for it though. The expectation is that I'll get either an
ASCII char or something out of ASCII range if it isn't ASCII.

The abstraction doesn't seem very 'solid' to me, in the sense that I
can't see how it'd apply to ranges other than strings, so it's only
useful for strings (the character array kind), and it's only useful as a
workaround since you made ElementType!(char[]) a dchar. Well, any range
returning char,dchar,wchar could map frontUnit to front and popFrontUnit
to popFront to keep things working, but it makes the optimization rather
pointless. I don't really have an idea where to go from here.

[snip]

I was thinking along the lines of:

struct Grapheme
{
     private string support_;
     ...
}

struct ByGrapheme
{
     private string iteratee_;
     bool empty();
     Grapheme front();
     void popFront();
     // Additional funs
     dchar frontCodePoint();
     void popFrontCodePoint();
     char frontCodeUnit();
     void popFrontCodeUnit();
     ...
}

// helper function
ByGrapheme byGrapheme(string s);

// usage
string s = ...;
size_t i;
foreach (g; byGrapheme(s))
{
     writeln("Grapheme #", i, " is ", g);
}

We need this range in Phobos.


Yes, we need a grapheme range.

But that's not what my thing was about. It was about shortcutting codepoint decoding when it isn't necessary while still keeping the abilityto decode to code points when iterating on the same range. Forinstance, here's a simple made up example:


        string s = "<hello>";
        if (!s.empty && s.frontUnit == '<')
                s.popFrontUnit(); // skip
        while (!s.empty && s.frontUnit != '>')
                s.popFront(); // do something with each code point
        if (!s.empty && s.frontUnit == '>')
                s.popFrontUnit(); // skip
        assert(s.empty);

Here, since I know I'm testing and skipping for '<', an ASCIIcharacter, decoding the code point is wasted time, so I skip thatdecoding. The problem is that this optimization can't happen with arange that abstracts things at the code point level. I can do it withstrings because strings still allow you to access code units throughthe indexing operators, but this can't really apply to ranges of codepoints in general.

And parsing with range of code unit would also be a pain, because evenif I'm testing for '<' for the first character, sometimes I really needto advance by code point and test for code points.

One thing that might be interesting is benchmarking my XML parser byreplacing every instance of frontUnit and popFrontUnit with front andpopFront. That won't change there results, but it'd give us an idea ofthe overhead of the unnecessary decoded characters code points.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to