On 2011-01-15 09:09:17 -0500, foobar <f...@bar.com> said:

Lutger Blijdestijn Wrote:

Michel Fortin wrote:

On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
<lutger.blijdest...@gmail.com> said:
...

Is it still possible to solve this problem or are we stuck with
specialized string algorithms? Would it work if VleRange of string was a
bidirectional range with string slices of graphemes as the ElementType
and indexing with code units? Often used string algorithms could be
specialized for performance, but if not, generic algorithms would still
work.

I have my idea.

I think it'd be a good idea is to improve upon Andrei's first idea --
which was to treat char[], wchar[], and dchar[] all as ranges of dchar
elements -- by changing the element type to be the same as the string.
For instance, iterating on a char[] would give you slices of char[],
each having one grapheme.

...

Yes, this is exactly what I meant, but you are much clearer. I hope this can
be made to work!


My two cents are against this kind of design.
The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g.
text.codeUnits to iterate by codeUnits

Nothing prevents that in the design I proposed. Andrei's design already implements "str".byDchar() that would work for code points. I'd suggest changing the API to by!char(), by!wchar(), and by!cdhar() for when you deal with whatever kind of code unit or code point you want. This would be mostly symmetric to what you can already do with foreach:

        foreach (char c; "hello") {}
        foreach (wchar c; "hello") {}
        foreach (dchar c; "hello") {}
// same as:
        foreach (c; "hello".by!char()) {}
        foreach (c; "hello".by!wchar()) {}
        foreach (c; "hello".by!dchar()) {}


Here's a (perhaps contrived) example:
Let's say I want to find the combining marks in some text.

For instance, Hebrew uses combining marks for vowels (among other things) and they are optional in the language (There's a "full" form with vowels and a "missing" form without them). I have a Hebrew text with in the "full" form and I want to strip it and convert it to the "missing" form.

How would I accomplish this with your design?

All you need is a range that takes a string as input and give you code points in a decomposed form (NFD), then you use std.algorithm.filter on it:

        // original string
        auto str = "...";

        // create normalized decomposed string as a lazy range of dchar (NFD)
        auto decomposed = decompose(str);

// filter to remove your favorite combining code point (use the hex code you want)
        auto filtered = filter!"a != 0xFABA"(decomposed);

        // turn it back in composed form (NFC), optional
        auto recomposed = compose(filtered);

        // convert back to a string (could also be wstring or dstring)
        string result = array(recomposed.by!char());

This last line is the one doing everything. All the rest just chain ranges together for doing on-the-fly decomposition, filtering, and recomposition; the last line uses that chain of rage to fill the array.

A more naive implementation not taking advantage of code points but instead using a replacement table would also work:

        string str = "...";
        string result;
        string[string] replacements = ["é":"e"]; // change this for what you 
want
        foreach (grapheme; str) {
                auto replacement = grapheme in replacements;
                if (replacement)
                        result ~= replacement;
                else
                        result ~= grapheme;
        }
        

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to