On 06/19/2016 11:44 PM, Walter Bright via Digitalmars-d wrote:
On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
To me it seems that a lot of the time processing is more efficient with UCS-4 (what I call utf-32). Storage is clearly more efficient with utf-8, but access is more direct with UCS-4. I agree that utf-8 is generally to be preferred where it can be efficiently used, but that's not everywhere. The problem is efficient bi-directional conversion...which D appears to handle fairly well already with text() and dtext(). (I don't see any utility for utf-16. To me
that seems like a first attempt that should have been deprecated.)

That seemed to me to be true, too, until I wrote a text processing program using UCS-4. It was rather slow. Turns out, 4x memory consumption has a huge performance cost.
The approach I took (which worked well for my purposes) was to process the text a line at a time, and for that the overhead of memory was trivial. ... If I'd needed to go back and forth this wouldn't have been desirable, but there was one dtext conversion, processing, and then several text conversions (of small portions), and it was quite efficient. Clearly this can't be the approach taken in all circumstances, but for this purpose it was significantly more efficient than any other approach I've tried. It's also true that most of the text I handled was actually ASCII, which would have made the most common conversion processes simpler.

To me it appears that both cases need to be handled. The problem is documenting the tradeoffs in efficiency. D seems to already work quite well with arrays of dchars, so there may well not be any need for development in that area. Direct indexing of utf-8 arrays, however, is a much more complicated thing, which I doubt can ever be as efficient. Memory allocation, however, is a separate, though not independent, complexity. If you can work in small chunks then it becomes less important.

Reply via email to