Re: perl unicode support [BACK OFF-TOPIC]

Rich Felker Sat, 07 Apr 2007 07:21:20 -0700

On Sat, Apr 07, 2007 at 01:46:22PM +0200, Marcin 'Qrczak' Kowalczyk wrote:
> For example in my language Kogut a string is a sequence of Unicode code
> points. My implementation uses two string representations internally:
> if it contains no characters above U+00FF, then it’s stored as a
> sequence of bytes, otherwise it’s a sequence of 32-bit integers.


> This variation is not visible in the language. The narrow case has
> a redundant NUL appended. When a string is passed to some C function
> and the function expects the default encoding (normally taken from
> the locale), then — under the assumption that a default encoding
> is ASCII-compatible — if the string contains only ASCII characters
> excluding NUL, a pointer to the string data is passed. Otherwise

I hope you generate an exception (or whatever the appropriate error
behavior is) if the string contains a NUL byte other than the
terminator when it's passed to C functions. Otherwise you risk the
same sort of vuln that burned Firefox. Passing a string with embedded
NULs where a NUL-terminated string is expected is an incompatible type
error, and a self-respecting HLL should catch and protect you from
this.

> a recoded array of bytes is created. This is quite a practical reason
> to store the redundant NULs, even though NUL is not special as far as
> the string type is concerned. Most strings manipulated by average
> programs are ASCII-only.

Using UTF-8 would have accomplished the same thing without
special-casing. Then even non-ASCII strings would use less memory. As
discussed recently on this list, most if not all of the advantages of
UTF-32 over UTF-8 are mythical.

> > Also note that there's nothing "backwards" about using termination
> > instead of length+data. For example it's the natural way a string
> > would be represented in a pure (without special string type) lisp-like
> > language. (Of course using a list is still binary clean because the
> > terminator is in the cdr rather than the car.)
> 
> The parenthesized remark is crucial. Lisp lists use an out-of-band
> terminator, not in-band.

Indeed, the point was more about the O(n) thing not being a problem.

> > And like with lists, C
> > strings have the advantage that a terminal substring of the original
> > string is already a string in-place, without copying.
> 
> This is too small advantage to overcome the inability of storing NULs
> and the lack of O(1) length check (which rules out bounds checking on
> indexing), and it’s impractical with garbage collection anyway.

C strings are usually used for small strings for which O(n) is O(1)
because n is bounded by, say, 4096 (PATH_MAX). Whenever discussing
these issues, it's essential to be aware that a plain string, whether
NUL-terminated or pascal-style, is unsuitable for a large class of
uses including any data that will frequently be edited. This is
because insertion or deletion is O(n). A reasonable program working
with large textual datasets will keep small strings (maybe lines, or
maybe arbitrary chunks of a certain max size) in a list structure with
a higher level data structure indexing them.

Certainly very high level languages could do the same with ordinary
strings, providing efficient primitives for insertion, deletion,
splitting, etc. but for the majority of tiny strings the overhead may
be a net loss. I kinda prefer the Emacs Lisp approach of having
immutable string objects for small jobs and full-fledged emacs buffers
for heavy weight text processing.

At this point I'm not sure to what degree this thread is
off-/on-topic. If any list members are offended by its continuation,
please say so.

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK OFF-TOPIC]

Reply via email to