On Fri, 20 Aug 2010 02:22:56 +0000, dsimcha wrote: > As I mentioned buried deep in another thread, std.string is in serious > need of fixing, for two reasons: > > 1. Most of it doesn't work with UTF-16/UTF-32 strings. > > 2. Much of it requires the input to be immutable even when there's no > good reason for this constraint. > > I'm trying to understand a few things before I dive into fixing it: > > 1. How did it get to be this way? Why did it seem like a good idea at > the time to only support UTF-8 and only immutable strings? > > 2. Is there any "deep" design/technical issue that makes these hard to > fix, or is it basically just lack of manpower and other priorities? >
The problems are combinatorial, because of encoding schemes. I imagine that when someone wants a function that is missing from std.string, they might write one, and might even add to it. I also found std.utf to not contain exactly what I needed. The functions toUTF16, to UTF8, have signatures like wstring toUTF16(const(dchar)[] s). But when hacking a class I found I wanted functions that would almost have the very same innards, but could also append mutable character arrays of any sort. // Does almost the same as toUTF16, but creates or appends a mutable array. void append_UTF16m(ref wchar[] r, const(dchar)[] s) {...} At the expense of another nested function call, which I imagine most people would not want to pay, toUTF16 becomes a call to append_UTF16m. wstring toUTF16(const(dchar)[] s) { wchar[] temp = null; append_UTF16m(temp, s); return assumeUnique(temp); } But isNumeric for me required a parsing function, when I was religiously trying to use ranges, and know what sort of conversion function to call afterwards. I know its really simple-minded, but it did the required job. enum NumberClass { NUM_ERROR = -1, NUM_EMPTY, NUM_INTEGER, NUM_REAL } /// R is an input range, P is a output range (put). /// Return a NumberClass value. /// Collect characters in P for later processing. /// Does no NAN or INF, only checks for error, empty, integer, or real. /// E or e might be an exponent, or just the end of a number. NumberClass getNumberString(R, P)(R ipt, P opt, int recurse = 0 ) { int digitct = 0; bool done = ipt.empty; bool decPoint = false; for(;;) { if (ipt.empty) break; auto test = ipt.front; ipt.popFront; switch(test) { case '-': case '+': if (digitct > 0) { done = true; } break; case '.': if (!decPoint) decPoint = true; else done = true; break; default: if (!isdigit(test)) { done = true; if (test == 'e' || test == 'E') { // Ambiguous end of number, or exponent? if (recurse == 0) { opt.put(test); if (getNumberString(ipt,opt, recurse+1) ==NumberClass.NUM_INTEGER) return NumberClass.NUM_REAL; else return NumberClass.NUM_ERROR; } // assume end of number } } else digitct++; break; } if (done) break; opt.put(test); } if (digitct == 0) return NumberClass.NUM_EMPTY; if (decPoint) return NumberClass.NUM_REAL; return NumberClass.NUM_INTEGER; } A string class. http://dsource.org/projects/xmlp/trunk/alt/ustring.d The component structures maintain a terminating null character and pretend it is not there. It seemed a good idea at the time when I was doing a lot of windows API calls which expected null terminated C-strings of char or wchar. The UString class does conversions on accessing cstr(), wstr() or dstr(), on the assumption that last used will be most frequent, and ideally caches a decent hash value. I only have some limited uses of UString so far, because character arrays are so powerful. struct cstext { char[] str_ = null; ... } struct wstext { wchar[] str_ = null; ... } struct dstext { dchar[] str_ = null; ... } class UString { private { union { vstruc vstr; // not fully supported? cstext cstr; wstext wstr; dstext dstr; } UStringType ztype; hash_t hash_; } ...