Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

foobar Sat, 15 Jan 2011 08:00:42 -0800

Michel Fortin Wrote:

> On 2011-01-15 09:09:17 -0500, foobar <f...@bar.com> said:
> 
> > Lutger Blijdestijn Wrote:
> > 
> >> Michel Fortin wrote:
> >> 
> >>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
> >>> <lutger.blijdest...@gmail.com> said:
> >> ...
> >>>> 
> >>>> Is it still possible to solve this problem or are we stuck with
> >>>> specialized string algorithms? Would it work if VleRange of string was a
> >>>> bidirectional range with string slices of graphemes as the ElementType
> >>>> and indexing with code units? Often used string algorithms could be
> >>>> specialized for performance, but if not, generic algorithms would still
> >>>> work.
> >>> 
> >>> I have my idea.
> >>> 
> >>> I think it'd be a good idea is to improve upon Andrei's first idea --
> >>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
> >>> elements -- by changing the element type to be the same as the string.
> >>> For instance, iterating on a char[] would give you slices of char[],
> >>> each having one grapheme.
> >>> 
> >> ...
> >> 
> >> Yes, this is exactly what I meant, but you are much clearer. I hope this 
> >> can
> >> be made to work!
> >> 
> > 
> > My two cents are against this kind of design.
> > The "correct" approach IMO is a 'universal text' type which is a 
> > _container_ of said text. This type would provide ranges for the 
> > various abstraction levels. E.g.
> > text.codeUnits to iterate by codeUnits
> 
> Nothing prevents that in the design I proposed. Andrei's design already 
> implements "str".byDchar() that would work for code points. I'd suggest 
> changing the API to by!char(), by!wchar(), and by!cdhar() for when you 
> deal with whatever kind of code unit or code point you want. This would 
> be mostly symmetric to what you can already do with foreach:
> 
>       foreach (char c; "hello") {}
>       foreach (wchar c; "hello") {}
>       foreach (dchar c; "hello") {}
> // same as:
>       foreach (c; "hello".by!char()) {}
>       foreach (c; "hello".by!wchar()) {}
>       foreach (c; "hello".by!dchar()) {}
> 
> 
> > Here's a (perhaps contrived) example:
> > Let's say I want to find the combining marks in some text.
> > 
> > For instance, Hebrew uses combining marks for vowels (among other 
> > things) and they are optional in the language (There's a "full" form 
> > with vowels and a "missing" form without them).
> > I have a Hebrew text with in the "full" form and I want to strip it and 
> > convert it to the "missing" form.
> > 
> > How would I accomplish this with your design?
> 
> All you need is a range that takes a string as input and give you code 
> points in a decomposed form (NFD), then you use std.algorithm.filter on 
> it:
> 
>       // original string
>       auto str = "...";
> 
>       // create normalized decomposed string as a lazy range of dchar (NFD)
>       auto decomposed = decompose(str);
> 
>       // filter to remove your favorite combining code point (use the hex 
> code you want)
>       auto filtered = filter!"a != 0xFABA"(decomposed);
> 
>       // turn it back in composed form (NFC), optional
>       auto recomposed = compose(filtered);
> 
>       // convert back to a string (could also be wstring or dstring)
>       string result = array(recomposed.by!char());
> 
> This last line is the one doing everything. All the rest just chain 
> ranges together for doing on-the-fly decomposition, filtering, and 
> recomposition; the last line uses that chain of rage to fill the array.
> 
> A more naive implementation not taking advantage of code points but 
> instead using a replacement table would also work:
> 
>       string str = "...";
>       string result;
>       string[string] replacements = ["é":"e"]; // change this for what you 
> want
>       foreach (grapheme; str) {
>               auto replacement = grapheme in replacements;
>               if (replacement)
>                       result ~= replacement;
>               else
>                       result ~= grapheme;
>       }
>       
> 
> -- 
> Michel Fortin
> michel.for...@michelf.com
> http://michelf.com/
>


Ok, I guess I missed the "byDchar()" method. 
I envisioned the same algorithm looking like this:
 
// original string
string str = "...";

// create normalized decomposed string as a lazy range of dchar (NFD)
// Note: explicitly specify code points range:
auto decomposed = decompose(str.codePoints);

// filter to remove your favorite combining code point
auto filtered = filter!"a != 0xFABA"(decomposed);

// turn it back in composed form (NFC), optional
auto recomposed = compose(filtered);
 
// convert back to a string
// Note: a string type can be constructed from a range of code points
string result = string(recomposed);
 
The difference is that a string type is distinct from the intermediate code 
point ranges (This happens in your design too albeit in a less obvious way to 
the user). There is string specific code. Why not encapsulate it in a string 
type instead of forcing the user to use complex APIs with templates everywhere?

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to