On 2011-01-15 09:09:17 -0500, foobar <f...@bar.com> said:
Lutger Blijdestijn Wrote:
Michel Fortin wrote:
On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
<lutger.blijdest...@gmail.com> said:
...
Is it still possible to solve this problem or are we stuck with
specialized string algorithms? Would it work if VleRange of string was a
bidirectional range with string slices of graphemes as the ElementType
and indexing with code units? Often used string algorithms could be
specialized for performance, but if not, generic algorithms would still
work.
I have my idea.
I think it'd be a good idea is to improve upon Andrei's first idea --
which was to treat char[], wchar[], and dchar[] all as ranges of dchar
elements -- by changing the element type to be the same as the string.
For instance, iterating on a char[] would give you slices of char[],
each having one grapheme.
...
Yes, this is exactly what I meant, but you are much clearer. I hope this can
be made to work!
My two cents are against this kind of design.
The "correct" approach IMO is a 'universal text' type which is a
_container_ of said text. This type would provide ranges for the
various abstraction levels. E.g.
text.codeUnits to iterate by codeUnits
Nothing prevents that in the design I proposed. Andrei's design already
implements "str".byDchar() that would work for code points. I'd suggest
changing the API to by!char(), by!wchar(), and by!cdhar() for when you
deal with whatever kind of code unit or code point you want. This would
be mostly symmetric to what you can already do with foreach:
foreach (char c; "hello") {}
foreach (wchar c; "hello") {}
foreach (dchar c; "hello") {}
// same as:
foreach (c; "hello".by!char()) {}
foreach (c; "hello".by!wchar()) {}
foreach (c; "hello".by!dchar()) {}
Here's a (perhaps contrived) example:
Let's say I want to find the combining marks in some text.
For instance, Hebrew uses combining marks for vowels (among other
things) and they are optional in the language (There's a "full" form
with vowels and a "missing" form without them).
I have a Hebrew text with in the "full" form and I want to strip it and
convert it to the "missing" form.
How would I accomplish this with your design?
All you need is a range that takes a string as input and give you code
points in a decomposed form (NFD), then you use std.algorithm.filter on
it:
// original string
auto str = "...";
// create normalized decomposed string as a lazy range of dchar (NFD)
auto decomposed = decompose(str);
// filter to remove your favorite combining code point (use the hex
code you want)
auto filtered = filter!"a != 0xFABA"(decomposed);
// turn it back in composed form (NFC), optional
auto recomposed = compose(filtered);
// convert back to a string (could also be wstring or dstring)
string result = array(recomposed.by!char());
This last line is the one doing everything. All the rest just chain
ranges together for doing on-the-fly decomposition, filtering, and
recomposition; the last line uses that chain of rage to fill the array.
A more naive implementation not taking advantage of code points but
instead using a replacement table would also work:
string str = "...";
string result;
string[string] replacements = ["é":"e"]; // change this for what you
want
foreach (grapheme; str) {
auto replacement = grapheme in replacements;
if (replacement)
result ~= replacement;
else
result ~= grapheme;
}
--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/