Irritating column numbers with encoding=utf-8
Hi, with 'encoding' set to "utf-8" there is a quite confusing (to me) difference between the column number and my expectations (supported by the virtual column number) if there are non-ASCII characters on the line. I don't know what the intended meaning of "column count" and the intended behaviour of "cursor()" are, but it seems they both depend on the size of the encoded characters. I always thought "nth column" was more or less a synonym for "nth character on a line" while "nth virtual column" meant "nth cell on a screen line". Here is how to reproduce the observed behaviour. Start vim -u NONE -U NONE and :set encoding=utf-8 :set laststatus=2 :set statusline=[%c/%v] (The last line tells VIM to display the column and the virtual column.) Now enter two lines abc äbc (The first letter in the second line is a lower case "A" with umlaut.) While moving the cursor over the different characters on the first line the status line shows "[1/1]", "[2/2]", and "[3/3]", respectively, telling you that "column" and "virtual column" are equal. That is the expected behaviour as long as there are no special characters like tabs and non-printable characters. Now move the cursor over the characters in the second line. While the cursor is over the "ä" "[1/1]" is displayed, but the next characters result in "[3/2]" and "[4/3]", respectively. It seems as if "ä" (or any non-ASCII character, for that matter) is accounting for (at least) two columns while encoding is set to "utf-8". Although I know that "ä" is represented by two bytes in UTF-8 encoding, I find this behaviour irritating because on the surface it's only one character. It even gets worse (IMHO) with characters that need three bytes in UTF-8 encoding, like LATIN CAPITAL LETTER A WITH DOT BELOW (0x1EA0), which increase the column number by three. Also the "cursor()" function shows this kind of interpretation of non-ASCII characters. Both call cursor(2, 1) and call cursor(2, 2) place the cursor on "ä". To place it on "b" you need to call cursor(2, 3) although I would expect that already the second example would place the cursor on "b". I can think of two ways to circumvent this problem: 1) switching to "encoding=latin1", which is not always an option because of the need for characters outside the scope of latin1; 2) using only virtual column numbers in the status line, but this gives different results when characters like tab or non-printables are displayed in more than one screen cell (which is of course reasonable). I don't know whether the shown behaviour is a bug or just a feature I don't like, but in summary I think "column number" should really represent a character count (i.e, corresponding to what the user sees), not a byte count depending on the underlying encoding. I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing the code will definitely introduce an incompatibility. So the final question is: What do you (Vimmers) and you (Bram) think: is there a need for a change. Regards, Jürgen -- Jürgen Krämer Softwareentwicklung HABEL GmbH & Co. KGmailto:[EMAIL PROTECTED] Hinteres Öschle 2 Tel: +49 / 74 61 / 93 53 - 15 78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99
Re: Irritating column numbers with encoding=utf-8
On Wed, Jul 05, 2006 at 11:50:51AM +0200, Jürgen Krämer wrote: > > Hi, > > with 'encoding' set to "utf-8" there is a quite confusing (to me) > difference between the column number and my expectations (supported by > the virtual column number) if there are non-ASCII characters on the > line. Column number n is really the nth byte on that line. This is described at ":help /\%c". This description should explain all the behavior you're seeing. This is the intended behavior and I'm not sure of a way off-hand to get the visual character count like you want. James -- GPG Key: 1024D/61326D40 2003-09-02 James Vega <[EMAIL PROTECTED]> signature.asc Description: Digital signature
Re: Irritating column numbers with encoding=utf-8
Hi, James Vega wrote: > > On Wed, Jul 05, 2006 at 11:50:51AM +0200, Jürgen Krämer wrote: >> >> with 'encoding' set to "utf-8" there is a quite confusing (to me) >> difference between the column number and my expectations (supported by >> the virtual column number) if there are non-ASCII characters on the >> line. > > Column number n is really the nth byte on that line. This is described > at ":help /\%c". This description should explain all the behavior > you're seeing. This is the intended behavior and I'm not sure of a way > off-hand to get the visual character count like you want. yes, it does *explain* the behaviour. But it makes things even worse. Suppose I have some lines with aligned data (just like a table) where I want to replace certain columns with dashes, e.g., PeterTraurig irgendwo 0 Hänschen Klein unterwegs 1 Jürgen Krämer hier 2 :%s/\%18c.*\%27c/-/ should strike out the third column of the table, but the result is PeterTraurig - 0 Hänschen Klein -s 1 Jürgen Krämer- 2 which is depending on the random number of non-ASCII characters in front of the used position, characters whose internal representations should never be relevant for this substitution, because the user cannot know them. Since it works as documented it is hard to call it a bug, but I would really consider it a mis-feature, because it works in such a non-predictable way. To work around the problem in this example is not that hard -- I can use /\%...v instead. The example in my original mail poses a bigger problem (to me) -- I'd like to switch to "encoding=utf-8" as default, but I often need to work with text files of fixed line length. With encoding set to "latin1" the difference between column number and virtual column number in the status line is a visual clue that there is a tabular or a control code in the line, reminding me to look for this character. With UTF-8 encoding this hint would be rendered useless because of all those little umlauts in German. :-( But perhaps this is just my special problem. Regards, Jürgen -- Jürgen Krämer Softwareentwicklung HABEL GmbH & Co. KGmailto:[EMAIL PROTECTED] Hinteres Öschle 2 Tel: +49 / 74 61 / 93 53 - 15 78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99
Re: Irritating column numbers with encoding=utf-8
On 7/5/06, Jürgen Krämer <[EMAIL PROTECTED]> wrote: To work around the problem in this example is not that hard -- I can use /\%...v instead. Yes The example in my original mail poses a bigger problem (to me) -- I'd like to switch to "encoding=utf-8" as default, but I often need to work with text files of fixed line length. With encoding set to "latin1" the difference between column number and virtual column number in the status line is a visual clue that there is a tabular or a control code in the line, reminding me to look for this character. With UTF-8 encoding this hint would be rendered useless because of all those little umlauts in German. :-( There's yet another reason for col()!=virtcol(). It's unprintable characters like ^A ^@ ^[ Granted, they occur rarely in textfiles, but if they do, they'll cause virtcol() != col(). If you stick with virtcol() and \%v, you'll probably not feel any inconvenience. I mean, there are two types of columns (virtual and non-virtual), and if someone confuses the two, and uses %\c instead of %\v or col() instead of virtcol(), or vice versa, it's inconvenient. Once the confusion is fixed, and you use the right type of column index, doesn't it solve the inconvenience ? (except that there are still two types of columns, which requires increased attention as to which one to use in each case) ? Yakov
Re: Irritating column numbers with encoding=utf-8
On 7/5/06, Jürgen Krämer <[EMAIL PROTECTED]> wrote: with 'encoding' set to "utf-8" there is a quite confusing (to me) difference between the column number and my expectations (supported by the virtual column number) if there are non-ASCII characters on the line. And additional remark. As James noted, \%c is not character offset (in case of multibyte chars), but the bytes offset. In case you want to match not by visual columns (\%v) and not by byte offset, but by character index in the line, you can do this: /^.\{22}xyz This matches xyz at 23nd char position, correctly counting each multibyte chars and each single char for 1 position. Does this possibly solve your matching problem ? Yakov
Re: Irritating column numbers with encoding=utf-8
Jürgen Krämer wrote: > with 'encoding' set to "utf-8" there is a quite confusing (to me) > difference between the column number and my expectations (supported by > the virtual column number) if there are non-ASCII characters on the > line. I don't know what the intended meaning of "column count" and the > intended behaviour of "cursor()" are, but it seems they both depend on > the size of the encoded characters. I always thought "nth column" was > more or less a synonym for "nth character on a line" while "nth virtual > column" meant "nth cell on a screen line". > > Here is how to reproduce the observed behaviour. Start > >vim -u NONE -U NONE > > and > > :set encoding=utf-8 > :set laststatus=2 > :set statusline=[%c/%v] > > (The last line tells VIM to display the column and the virtual column.) > Now enter two lines > > abc > äbc > > (The first letter in the second line is a lower case "A" with umlaut.) > While moving the cursor over the different characters on the first line > the status line shows "[1/1]", "[2/2]", and "[3/3]", respectively, > telling you that "column" and "virtual column" are equal. That is the > expected behaviour as long as there are no special characters like tabs > and non-printable characters. > > Now move the cursor over the characters in the second line. While the > cursor is over the "ä" "[1/1]" is displayed, but the next characters > result in "[3/2]" and "[4/3]", respectively. It seems as if "ä" (or any > non-ASCII character, for that matter) is accounting for (at least) two > columns while encoding is set to "utf-8". Although I know that "ä" is > represented by two bytes in UTF-8 encoding, I find this behaviour > irritating because on the surface it's only one character. It even gets > worse (IMHO) with characters that need three bytes in UTF-8 encoding, > like LATIN CAPITAL LETTER A WITH DOT BELOW (0x1EA0), which increase the > column number by three. > > Also the "cursor()" function shows this kind of interpretation of > non-ASCII characters. Both > > call cursor(2, 1) > > and > > call cursor(2, 2) > > place the cursor on "ä". To place it on "b" you need to > > call cursor(2, 3) > > although I would expect that already the second example would place the > cursor on "b". > > I can think of two ways to circumvent this problem: > > 1) switching to "encoding=latin1", which is not always an option > because of the need for characters outside the scope of latin1; > > 2) using only virtual column numbers in the status line, but this > gives different results when characters like tab or non-printables > are displayed in more than one screen cell (which is of course > reasonable). > > I don't know whether the shown behaviour is a bug or just a feature I > don't like, but in summary I think "column number" should really > represent a character count (i.e, corresponding to what the user sees), > not a byte count depending on the underlying encoding. > > I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing > the code will definitely introduce an incompatibility. So the final > question is: What do you (Vimmers) and you (Bram) think: is there a need > for a change. I don't know why you call this a column count, in most places it's called a byte count. Perhaps in some places in the docs the remark about this actually being a byte count is missing. You could also want a character count. But what is a character when using composing characters? E.g., when the umlaut is not included in a character but added as a separate composing character? It's not so obvious what to do. In these situations I rather keep it as it is. -- DENNIS: Look, strange women lying on their backs in ponds handing out swords ... that's no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net \\\ ///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\download, build and distribute -- http://www.A-A-P.org/// \\\help me help AIDS victims -- http://ICCF-Holland.org///
Re: Irritating column numbers with encoding=utf-8
Hi, Bram Moolenaar wrote: > > Jürgen Krämer wrote: > >> with 'encoding' set to "utf-8" there is a quite confusing (to me) >> difference between the column number and my expectations (supported by >> the virtual column number) if there are non-ASCII characters on the >> line. I don't know what the intended meaning of "column count" and the >> intended behaviour of "cursor()" are, but it seems they both depend on >> the size of the encoded characters. I always thought "nth column" was >> more or less a synonym for "nth character on a line" while "nth virtual >> column" meant "nth cell on a screen line". >> [snipped >> >> I don't know whether the shown behaviour is a bug or just a feature I >> don't like, but in summary I think "column number" should really >> represent a character count (i.e, corresponding to what the user sees), >> not a byte count depending on the underlying encoding. >> >> I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing >> the code will definitely introduce an incompatibility. So the final >> question is: What do you (Vimmers) and you (Bram) think: is there a need >> for a change. > > I don't know why you call this a column count, in most places it's > called a byte count. Perhaps in some places in the docs the remark > about this actually being a byte count is missing. sorry, the "column count" in the first paragraph should have been a "column number". I called it so because I have the statusline option set to %<%f%= [%1*%M%*%{','.&fileformat}%R%Y] [%6l,%4c%V] %3b=0x%02B %P and noticed that "%4c-%V" displayed two numbers instead of the one I expected, because I knew there were no tabs or unprintable characters on that line. Even more disturbing was the fact that the first number (the column number) was bigger than the second one (the virtual column number). So I checked ":help statusline" and it told me c N Column number. v N Virtual column number. V N Virtual column number as -{num}. Not displayed if equal to 'c'. > You could also want a character count. But what is a character when > using composing characters? E.g., when the umlaut is not included in > a character but added as a separate composing character? I would say that a character is what the user sees. Why should he (be forced to) know wheter "ä" is represented internally as LATIN SMALL LETTER A WITH DIAERESIS or as LATIN SMALL LETTER A plus COMBINING DIARESIS? So in my opinion "column count" is equivalent to "character count" unless there are characters like tabs and unprintable ones that have a special representation -- on the screen, not internally. > It's not so obvious what to do. In these situations I rather keep it as > it is. I know it's a big change and would introduce imcompatibiliy with older versions, but here is another example: Take this line (ignoring the leading spaces) ääbbcc and the following commands :s/\%3c../xx/ %s/^..\zs../xx/ >From my point of view they should both replace the 3rd and 4th column with "xx". When encoding is set to latin1 they do, but not when it is set to utf-8 -- the first one replaces "äb" with "xx". As a user I would be really stumbled and ask "Why that, it's the same text as before." Changing these commands to :s/\%2c../xx/ %s/^.\zs../xx/ makes things even more irritating. The second one works as expected, now correctly replacing "äb" with "xx", but the first one fails with "E486: Pattern not found: \%2c..". Again: Ought I (as a user) really need to know that \%2c depends on the number of non-ASCII letters in front of the column I'm interested in? Regards, Jürgen -- Jürgen Krämer Softwareentwicklung HABEL GmbH & Co. KGmailto:[EMAIL PROTECTED] Hinteres Öschle 2 Tel: +49 / 74 61 / 93 53 - 15 78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99
RE: Irritating column numbers with encoding=utf-8
> -Original Message- > From: Jürgen Krämer [mailto:[EMAIL PROTECTED] > Sent: 06 July 2006 08:01 > To: vim mailing list > Subject: Re: Irritating column numbers with encoding=utf-8 > > > Hi, > > Bram Moolenaar wrote: > > > > Jürgen Krämer wrote: > > > >> with 'encoding' set to "utf-8" there is a quite confusing (to me) > >> difference between the column number and my expectations > (supported by > >> the virtual column number) if there are non-ASCII characters on the > >> line. I don't know what the intended meaning of "column > count" and the > >> intended behaviour of "cursor()" are, but it seems they > both depend on > >> the size of the encoded characters. I always thought "nth > column" was > >> more or less a synonym for "nth character on a line" while > "nth virtual > >> column" meant "nth cell on a screen line". > >> > [snipped > >> > >> I don't know whether the shown behaviour is a bug or just > a feature I > >> don't like, but in summary I think "column number" should really > >> represent a character count (i.e, corresponding to what > the user sees), > >> not a byte count depending on the underlying encoding. > >> > >> I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, > so changing > >> the code will definitely introduce an incompatibility. So the final > >> question is: What do you (Vimmers) and you (Bram) think: > is there a need > >> for a change. > > > > I don't know why you call this a column count, in most places it's > > called a byte count. Perhaps in some places in the docs the remark > > about this actually being a byte count is missing. > > sorry, the "column count" in the first paragraph should have been a > "column number". I called it so because I have the statusline > option set > to > > %<%f%= [%1*%M%*%{','.&fileformat}%R%Y] [%6l,%4c%V] %3b=0x%02B %P > > and noticed that "%4c-%V" displayed two numbers instead of the one I > expected, because I knew there were no tabs or unprintable characters > on that line. Even more disturbing was the fact that the first number > (the column number) was bigger than the second one (the virtual column > number). So I checked ":help statusline" and it told me > > c N Column number. > v N Virtual column number. > V N Virtual column number as -{num}. Not displayed > if equal to 'c'. > > > You could also want a character count. But what is a character when > > using composing characters? E.g., when the umlaut is not > included in > > a character but added as a separate composing character? > > I would say that a character is what the user sees. Why should he (be > forced to) know wheter "ä" is represented internally as LATIN SMALL > LETTER A WITH DIAERESIS or as LATIN SMALL LETTER A plus COMBINING > DIARESIS? So in my opinion "column count" is equivalent to "character > count" unless there are characters like tabs and unprintable ones that > have a special representation -- on the screen, not internally. > > > It's not so obvious what to do. In these situations I > rather keep it as > > it is. > > I know it's a big change and would introduce imcompatibiliy with older > versions, but here is another example: Take this line (ignoring the > leading spaces) > > ääbbcc > > and the following commands > > :s/\%3c../xx/ > %s/^..\zs../xx/ > > From my point of view they should both replace the 3rd and 4th column > with "xx". When encoding is set to latin1 they do, but not when it is > set to utf-8 -- the first one replaces "äb" with "xx". As a > user I would > be really stumbled and ask "Why that, it's the same text as before." > > Changing these commands to > > :s/\%2c../xx/ > %s/^.\zs../xx/ > > makes things even more irritating. The second one works as > expected, now > correctly replacing "äb" with "xx", but the first one fails > with "E486: > Pattern not found: \%2c..". Again: Ought I (as a user) really need to > know that \%2c depends on the number of non-ASCII letters in front of > the column I'm interested in? Yes, this is indeed very unexpected IMHO and as you say mighty irritating. I find it very hard to disagree with your arguments. This should be changed IMHO, even if it surely is a big change. ---Zdenek