String encoding within Glib/GTK (Was: g_utf8_validate() and NUL characters)

Freddie Unpenstein Sat, 11 Oct 2008 23:21:11 -0700

Freddie Unpenstein wrote:

Okay, to clarify one point.  I was speaking more of NULL handling in general 
within Glib/GTK, rather than ONLY the one function named in the subject line.  
This thread of the discussion should have been fork()ed a few messages before I 
jumped in, I suppose.



From: "Behdad Esfahbod" , 12/10/2008 07:22

>> I believe, that differs from the UTF-8 specification ONLY in the
>> handling of the NULL byte, but then I've been avoiding dealing with
>> UTF-8 for the most part for exactly this reason. When UTF-8 is a strict
>> issue, I've been using higher-level scripted languages instead, that
>> already deal with it natively. (And I'm not 100% certain, but I think
>> that's essentially what they all do.)
> False. XML doesn't do such nasty stuff. HTTP doesn't either. HTML doesn't
> either. *Only* Java does. There's a reason standarrds are good, and there's
> a reason people use standards.

XML isn't processing the text, iterating over the text, etc.  Neither does 
HTTP.  It is appropriate for them to employ the UTF-8 standard, in all it's 
absolutely rock-solid glory.  This isn't XML or HTTP I'm talking about, this is 
writing an application in the C programming language that may well be 
processing such.  I'm not sure what your point was there, it seems off-topic to 
me.

You're wrong on the ONLY part, also.  Java isn't the *Only* higher-level 
language around.  And regardless, how a HLL wishes to store its strings is of 
no concern, as long as it does the right thing at the borders.  The same goes 
for Glib/GTK.  And THAT is the point of this portion of my argument.  It's 
perfectly valid to have a not-quite-UTF-8 internal string format, as long as 
it's kept internal.  If your producing output (not only in XML or HTML), than 
by all means do the _to_utf8 conversion.  It should be REQUIRED anyhow, in this 
day and age!

You are absolutely right in one regard, though, standards ARE good.  They also 
have their place, they're designed for a perfect, and they're not perfect.  
External data SHOULD be standards compliant.  Due care IS required, a 
specification IS needed, and that specification MUST be upheld, for things to 
work.  But, internal data does NOT have to be held to the same rigours as 
external data, and in many cases it is wrong to impose a perfectly correct 
external data standard on the internal representation of that data; it can 
quite readily be half a step to the left of the external standard, if doing so 
makes working with that standard easier and less confusing, especially for the 
less capable.  Just look at network byte ordering.  It is a standard, that 
doesn't mean that every program uses that byte ordering throughout internally.  
Well-written programs convert it at the borders, or at the least, guard it 
closely until they do.

Which is better for the application; RIGHT code that uses a slightly non-UTF-8 
internal form, or WRONG code that tries to do the right thing but fails due to 
the added complexity and/or unexpected gotchas.


>> A "convert to UTF-8" function given a UTF-8 input with a 6-byte
>> representation of the character 'A' would store the regular single-byte
>> representation.
> False. It errs on an overlong representation of 'A'. If it doesn't, it's a 
> bug.

Well, for one thing I was obviously speaking there of conversion rather than 
validation.  For conversion, you MIGHT want to be strict.  MOST however, will 
want to be as tolerant as they can of almost-right data.  Better is to have the 
conversion function flexible, and if you're worried, validate it prior to 
conversion with something like g_utf8_validate().  An over-long representation 
of 'A' (I could imagine a sloppy UTF-16 program letting something like that 
through) sitting in a data file shouldn't break your program, unless there's an 
external contract stating that it should.  If there is, g_utf8_validate() will 
uphold that contract just fine prior to conversion.


> You're totally missing the point. Allowing an alternate nul representation
> opens security problems much harder to track down than a crashing application.
> There's a lot of literature to read about this already. Not going to
> continue it here.

I've read a fair bit of such literature myself.  Probably not quite so much as 
you, so I'm more than happy to be enlightened.  But real \0 NULLs also 
introduce security problems that don't always crash a program.  If one small 
function in a program decides to treat a NULL as a special character, even by 
mistake, then you've got a bit hard to find problem.  Allowing alternate 
representations is the problem, demanding exactly one specific alternate 
representation is not.

I'll even repeat that part a little differently; the critical point that every 
single piece of literature I've read makes, is to have ONE definition of each 
character on the inside, and to properly control the entry and exit of that 
data.  It also does NOT try to dictate WHAT that definition should be, only 
that it SHOULD avoid being something that has special magical meanings in 
unexpected cases.  NULL qualifies a magical.

If my interpretation is wrong, please DO continue.  Because I've heard a few 
others who appear to have the same interpretation of that literature as I do, 
if not the same conclusion about how it should be handled.


Fredderic


------------------------------------------------------------
River Rafting
Get ready for wet and wild fun with a river rafting adventure! Click now!
http://tagline.excite.com/fc/JkJQPTgKzoPbbR3g9Rw2T5hsZrKyu1uwbASoLTF3tQ5wEiA8IyPdXm/

_______________________________________________
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list

String encoding within Glib/GTK (Was: g_utf8_validate() and NUL characters)

Reply via email to