Michel Fortin Wrote: > On 2011-01-15 09:09:17 -0500, foobar <f...@bar.com> said: > > > Lutger Blijdestijn Wrote: > > > >> Michel Fortin wrote: > >> > >>> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn > >>> <lutger.blijdest...@gmail.com> said: > >> ... > >>>> > >>>> Is it still possible to solve this problem or are we stuck with > >>>> specialized string algorithms? Would it work if VleRange of string was a > >>>> bidirectional range with string slices of graphemes as the ElementType > >>>> and indexing with code units? Often used string algorithms could be > >>>> specialized for performance, but if not, generic algorithms would still > >>>> work. > >>> > >>> I have my idea. > >>> > >>> I think it'd be a good idea is to improve upon Andrei's first idea -- > >>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar > >>> elements -- by changing the element type to be the same as the string. > >>> For instance, iterating on a char[] would give you slices of char[], > >>> each having one grapheme. > >>> > >> ... > >> > >> Yes, this is exactly what I meant, but you are much clearer. I hope this > >> can > >> be made to work! > >> > > > > My two cents are against this kind of design. > > The "correct" approach IMO is a 'universal text' type which is a > > _container_ of said text. This type would provide ranges for the > > various abstraction levels. E.g. > > text.codeUnits to iterate by codeUnits > > Nothing prevents that in the design I proposed. Andrei's design already > implements "str".byDchar() that would work for code points. I'd suggest > changing the API to by!char(), by!wchar(), and by!cdhar() for when you > deal with whatever kind of code unit or code point you want. This would > be mostly symmetric to what you can already do with foreach: > > foreach (char c; "hello") {} > foreach (wchar c; "hello") {} > foreach (dchar c; "hello") {} > // same as: > foreach (c; "hello".by!char()) {} > foreach (c; "hello".by!wchar()) {} > foreach (c; "hello".by!dchar()) {} > > > > Here's a (perhaps contrived) example: > > Let's say I want to find the combining marks in some text. > > > > For instance, Hebrew uses combining marks for vowels (among other > > things) and they are optional in the language (There's a "full" form > > with vowels and a "missing" form without them). > > I have a Hebrew text with in the "full" form and I want to strip it and > > convert it to the "missing" form. > > > > How would I accomplish this with your design? > > All you need is a range that takes a string as input and give you code > points in a decomposed form (NFD), then you use std.algorithm.filter on > it: > > // original string > auto str = "..."; > > // create normalized decomposed string as a lazy range of dchar (NFD) > auto decomposed = decompose(str); > > // filter to remove your favorite combining code point (use the hex > code you want) > auto filtered = filter!"a != 0xFABA"(decomposed); > > // turn it back in composed form (NFC), optional > auto recomposed = compose(filtered); > > // convert back to a string (could also be wstring or dstring) > string result = array(recomposed.by!char()); > > This last line is the one doing everything. All the rest just chain > ranges together for doing on-the-fly decomposition, filtering, and > recomposition; the last line uses that chain of rage to fill the array. > > A more naive implementation not taking advantage of code points but > instead using a replacement table would also work: > > string str = "..."; > string result; > string[string] replacements = ["é":"e"]; // change this for what you > want > foreach (grapheme; str) { > auto replacement = grapheme in replacements; > if (replacement) > result ~= replacement; > else > result ~= grapheme; > } > > > -- > Michel Fortin > michel.for...@michelf.com > http://michelf.com/ >
Ok, I guess I missed the "byDchar()" method. I envisioned the same algorithm looking like this: // original string string str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) // Note: explicitly specify code points range: auto decomposed = decompose(str.codePoints); // filter to remove your favorite combining code point auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string // Note: a string type can be constructed from a range of code points string result = string(recomposed); The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?