Freddie Unpenstein wrote:
Okay, to clarify one point. I was speaking more of NULL handling in general
within Glib/GTK, rather than ONLY the one function named in the subject line.
This thread of the discussion should have been fork()ed a few messages before I
jumped in, I suppose.
From: "Behdad Esfahbod" , 12/10/2008 07:22
>> I believe, that differs from the UTF-8 specification ONLY in the
>> handling of the NULL byte, but then I've been avoiding dealing with
>> UTF-8 for the most part for exactly this reason. When UTF-8 is a strict
>> issue, I've been using higher-level scripted languages instead, that
>> already deal with it natively. (And I'm not 100% certain, but I think
>> that's essentially what they all do.)
> False. XML doesn't do such nasty stuff. HTTP doesn't either. HTML doesn't
> either. *Only* Java does. There's a reason standarrds are good, and there's
> a reason people use standards.
XML isn't processing the text, iterating over the text, etc. Neither does
HTTP. It is appropriate for them to employ the UTF-8 standard, in all it's
absolutely rock-solid glory. This isn't XML or HTTP I'm talking about, this is
writing an application in the C programming language that may well be
processing such. I'm not sure what your point was there, it seems off-topic to
me.
You're wrong on the ONLY part, also. Java isn't the *Only* higher-level
language around. And regardless, how a HLL wishes to store its strings is of
no concern, as long as it does the right thing at the borders. The same goes
for Glib/GTK. And THAT is the point of this portion of my argument. It's
perfectly valid to have a not-quite-UTF-8 internal string format, as long as
it's kept internal. If your producing output (not only in XML or HTML), than
by all means do the _to_utf8 conversion. It should be REQUIRED anyhow, in this
day and age!
You are absolutely right in one regard, though, standards ARE good. They also
have their place, they're designed for a perfect, and they're not perfect.
External data SHOULD be standards compliant. Due care IS required, a
specification IS needed, and that specification MUST be upheld, for things to
work. But, internal data does NOT have to be held to the same rigours as
external data, and in many cases it is wrong to impose a perfectly correct
external data standard on the internal representation of that data; it can
quite readily be half a step to the left of the external standard, if doing so
makes working with that standard easier and less confusing, especially for the
less capable. Just look at network byte ordering. It is a standard, that
doesn't mean that every program uses that byte ordering throughout internally.
Well-written programs convert it at the borders, or at the least, guard it
closely until they do.
Which is better for the application; RIGHT code that uses a slightly non-UTF-8
internal form, or WRONG code that tries to do the right thing but fails due to
the added complexity and/or unexpected gotchas.
>> A "convert to UTF-8" function given a UTF-8 input with a 6-byte
>> representation of the character 'A' would store the regular single-byte
>> representation.
> False. It errs on an overlong representation of 'A'. If it doesn't, it's a
> bug.
Well, for one thing I was obviously speaking there of conversion rather than
validation. For conversion, you MIGHT want to be strict. MOST however, will
want to be as tolerant as they can of almost-right data. Better is to have the
conversion function flexible, and if you're worried, validate it prior to
conversion with something like g_utf8_validate(). An over-long representation
of 'A' (I could imagine a sloppy UTF-16 program letting something like that
through) sitting in a data file shouldn't break your program, unless there's an
external contract stating that it should. If there is, g_utf8_validate() will
uphold that contract just fine prior to conversion.
> You're totally missing the point. Allowing an alternate nul representation
> opens security problems much harder to track down than a crashing application.
> There's a lot of literature to read about this already. Not going to
> continue it here.
I've read a fair bit of such literature myself. Probably not quite so much as
you, so I'm more than happy to be enlightened. But real \0 NULLs also
introduce security problems that don't always crash a program. If one small
function in a program decides to treat a NULL as a special character, even by
mistake, then you've got a bit hard to find problem. Allowing alternate
representations is the problem, demanding exactly one specific alternate
representation is not.
I'll even repeat that part a little differently; the critical point that every
single piece of literature I've read makes, is to have ONE definition of each
character on the inside, and to properly control the entry and exit of that
data. It also does NOT try to dictate WHAT that definition should be, only
that it SHOULD avoid being something that has special magical meanings in
unexpected cases. NULL qualifies a magical.
If my interpretation is wrong, please DO continue. Because I've heard a few
others who appear to have the same interpretation of that literature as I do,
if not the same conclusion about how it should be handled.
Fredderic
------------------------------------------------------------
River Rafting
Get ready for wet and wild fun with a river rafting adventure! Click now!
http://tagline.excite.com/fc/JkJQPTgKzoPbbR3g9Rw2T5hsZrKyu1uwbASoLTF3tQ5wEiA8IyPdXm/
_______________________________________________
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list