Re: UTF-8 Everywhere

Charles Hixson via Digitalmars-d Mon, 20 Jun 2016 11:36:22 -0700

On 06/19/2016 11:44 PM, Walter Bright via Digitalmars-d wrote:

On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
To me it seems that a lot of the time processing is more efficientwith UCS-4(what I call utf-32). Storage is clearly more efficient with utf-8,but accessis more direct with UCS-4. I agree that utf-8 is generally to bepreferredwhere it can be efficiently used, but that's not everywhere. Theproblem isefficient bi-directional conversion...which D appears to handlefairly wellalready with text() and dtext(). (I don't see any utility forutf-16. To me
that seems like a first attempt that should have been deprecated.)
That seemed to me to be true, too, until I wrote a text processingprogram using UCS-4. It was rather slow. Turns out, 4x memoryconsumption has a huge performance cost.

The approach I took (which worked well for my purposes) was to processthe text a line at a time, and for that the overhead of memory wastrivial. ... If I'd needed to go back and forth this wouldn't have beendesirable, but there was one dtext conversion, processing, and thenseveral text conversions (of small portions), and it was quiteefficient. Clearly this can't be the approach taken in allcircumstances, but for this purpose it was significantly more efficientthan any other approach I've tried. It's also true that most of the textI handled was actually ASCII, which would have made the most commonconversion processes simpler.

To me it appears that both cases need to be handled. The problem isdocumenting the tradeoffs in efficiency. D seems to already work quitewell with arrays of dchars, so there may well not be any need fordevelopment in that area. Direct indexing of utf-8 arrays, however, isa much more complicated thing, which I doubt can ever be as efficient.Memory allocation, however, is a separate, though not independent,complexity. If you can work in small chunks then it becomes less important.

Re: UTF-8 Everywhere

Reply via email to