Re: [Development] Why can't QString use UTF-8 internally?

Rutledge Shawn Thu, 12 Feb 2015 01:12:53 -0800

On 12 Feb 2015, at 08:55, Konstantin Ritt <ritt...@gmail.com> wrote:

> 2015-02-12 11:53 GMT+04:00 Konstantin Ritt <ritt...@gmail.com>:
> 2015-02-12 11:39 GMT+04:00 Rutledge Shawn <shawn.rutle...@theqtcompany.com>:
> 
> On 11 Feb 2015, at 18:15, Konstantin Ritt <ritt...@gmail.com> wrote:
> 
> > FYI: Unicode codepoint != character visual representation. Moreover, a 
> > single character could be represented with  a sequence of glyps or vice 
> > versa - a sequence of characters could be represented with a single glyph.
> > QString (and every other Unicode string class in the world) represents a 
> > sequence of Unicode codepoints (in this or that UTF), not characters or 
> > glyphs - always remember that!
> 
> Is it impossible to convert some of the possible multi-codepoint sequences 
> into single ones, or is it just that we prefer to preserve them so that when 
> you convert back to UTF you get the same bytes with which you created the 
> QString?
> 
> Not sure I understand your question in context of visual representation.
> Assume you're talking about composing the input string (though the same 
> string, composed and decomposed, would be shaped into the same sequence of 
> glyphs).
> A while ago we decided to not change the composition form of the input text 
> and let the user to (de)compose where he needs a fixed composition form, so 
> that QString(wellformed_unicode_text).toUnicode() == wellformed_unicode_text.
> 
> P.S. We could re-consider this or could introduce a macro that would change 
> the composition form of a QString input but…why?

It might be almost within our power to index into a QString and get a single, 
complete, renderable glyph, which is practical at least for rendering, and 
maybe for editing too.  But if we did that by storing the unicode that way, 
we’d lose this feature of being able to reproduce the input text exactly:
QString(wellformed_unicode_text).toUnicode() == wellformed_unicode_text
Consequently we have to do conversion each time we need the renderable text, 
and/or cache the results to avoid converting repeatedly.  Right?

But it would also be possible to go the other direction: save only UTF-8 form 
in memory, so by definition we can reproduce text that was given in UTF-8 form. 
 But if the QString was constructed from text in some other UTF form, can we 
simply remember which UTF it was, convert to UTF-8, and then be able to 
reproduce it exactly by converting back from UTF-8?  And we still need to be 
able to do conversion to renderable glyphs, and maybe cache them.

I see that QTextBoundaryFinder has
     const QChar *chars;
so that is nearly a cache of glyphs, right?  except that e.g. a soft hyphen 
could exist in that list, and yet may or may not be rendered depending where 
the line breaks fall?  So at the end, rendering must still be done iteratively 
by calling toNextBoundary repeatedly and pulling out substrings and rendering 
those.  (QTextBoundaryFinder doesn’t have a QChar grapheme() accessor.  I guess 
that’s done elsewhere.)  But some of the decisions have been made already, and 
embodied into that array of QChars.  I was wondering whether it’s worthwhile to 
do more work each time we iterate, by using UTF-8 form directly, instead of 
converting to an array of QChars first.  So the memory to store the string 
would be less, but the code to do the glyph-by-glyph iteration at rendering 
time would become more “branchy”, and that is also bad for CPU cache 
performance.

Oh but there’s another way of storing glyphs: the list of QScriptItems in the 
text engine.  That looks kindof bulky too, depending how long we keep it around.

So Unicode is a mini-language which has to be interpreted at some point on the 
way to rendering; there’s no pre-interpreted form we could store it in.  
TrueType is also a mini-language.  Maybe it would be possible to write a 
compiler which reads UTF-8 and TrueType and writes (nearly) branch-free code to 
render a whole line or block of text, so we could cache code instead of data.  
It could be more compact and CPU cache-friendly.  I imagine nobody has done 
that yet.  But then if you think about all the fancy stuff TeX can do, it could 
get even more complex than what Qt currently does.  And I don’t understand much 
about what Harfbuzz does yet, either.

_______________________________________________
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development

Re: [Development] Why can't QString use UTF-8 internally?

Reply via email to