Hi, On Mon, Apr 04, 2005 at 11:35:44AM +0200, Roland Illig wrote:
> * the _size_ of a string (as well as for other objects) is the number of > bytes that is allocated for it. For arrays, it is the number of > entries of the array. For strings it is at least _length_ + 1. > > * the _length_ of a string is the number of characters in it, excluding > the terminating '\0'. > > * the _width_ and _height_ of a string are the size of a box on the > screen that would be needed to display the string. It seems to me that this terminology is not yet multibyte-aware. Since UTF-8 becomes an everyday issue and AFAIR is planned for mainstream mc 4.7.0, IMHO it is very important to create a clear terminology for this even if it's not yet officially implemented now. Hence: Byte and character are two completely different notions. A byte is clear what it means. A character is a human-visible entity, e.g. an accented letter. A character may be represented by one or more bytes. It should be clarified whether composing symbols (e.g. to put an accent on the top of the previous letter) is a character on its own or not. Pressing a letter on the keyboard usually inserts one character, and a backspace/delete is supposed to remove one character, not one byte. Is the _length_ of a string the number of bytes in it or the number of characters in it? If it is the number of bytes, then the second definition (in the quoted part) should be corrected. If it is the number of characters, then the last sentence of the first definition doesn't really have a meaning since then the size and the length have really nothing to do with each other and hence the size >= length + 1 constraint is misleading (even though it isn't false supposing that every character takes at least one byte to represent). Actually, what does string mean? Is it an arbitrary sequence of bytes terminating with the first zero byte in it that we sometimes try do display somehow, or is it a technical representation of a human-readable text? These two approaches might lead to a completely different programming philosophy. I recommend the latter version since that one really thinks in the term which is the most important for the user interface, that is, it thinks in the meaning of the byte sequence rather than in the pure byte sequence on its own. Another consequence is that according to the second possible definition the byte sequence must always be valid according to one well-defined character set (e.g. valid UTF-8) while the first version also allows invalid byte sequences that still should be displayed somehow. Furthermore, it should be emphasized that the width of a character is not necessarily 1, so the number of bytes, number of characters and the width of a string may be three completely different values. -- Egmont _______________________________________________ Mc-devel mailing list http://mail.gnome.org/mailman/listinfo/mc-devel