On 06/19/2016 11:44 PM, Walter Bright via Digitalmars-d wrote:
On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
To me it seems that a lot of the time processing is more efficient
with UCS-4
(what I call utf-32). Storage is clearly more efficient with utf-8,
but access
is more direct with UCS-4. I agree that utf-8 is generally to be
preferred
where it can be efficiently used, but that's not everywhere. The
problem is
efficient bi-directional conversion...which D appears to handle
fairly well
already with text() and dtext(). (I don't see any utility for
utf-16. To me
that seems like a first attempt that should have been deprecated.)
That seemed to me to be true, too, until I wrote a text processing
program using UCS-4. It was rather slow. Turns out, 4x memory
consumption has a huge performance cost.
The approach I took (which worked well for my purposes) was to process
the text a line at a time, and for that the overhead of memory was
trivial. ... If I'd needed to go back and forth this wouldn't have been
desirable, but there was one dtext conversion, processing, and then
several text conversions (of small portions), and it was quite
efficient. Clearly this can't be the approach taken in all
circumstances, but for this purpose it was significantly more efficient
than any other approach I've tried. It's also true that most of the text
I handled was actually ASCII, which would have made the most common
conversion processes simpler.
To me it appears that both cases need to be handled. The problem is
documenting the tradeoffs in efficiency. D seems to already work quite
well with arrays of dchars, so there may well not be any need for
development in that area. Direct indexing of utf-8 arrays, however, is
a much more complicated thing, which I doubt can ever be as efficient.
Memory allocation, however, is a separate, though not independent,
complexity. If you can work in small chunks then it becomes less important.