Irritating column numbers with encoding=utf-8

2006-07-05 Thread Jürgen Krämer

Hi,

with 'encoding' set to "utf-8" there is a quite confusing (to me)
difference between the column number and my expectations (supported by
the virtual column number) if there are non-ASCII characters on the
line. I don't know what the intended meaning of "column count" and the
intended behaviour of "cursor()" are, but it seems they both depend on
the size of the encoded characters. I always thought "nth column" was
more or less a synonym for "nth character on a line" while "nth virtual
column" meant "nth cell on a screen line".

Here is how to reproduce the observed behaviour. Start

   vim -u NONE -U NONE

and

  :set encoding=utf-8
  :set laststatus=2
  :set statusline=[%c/%v]

(The last line tells VIM to display the column and the virtual column.)
Now enter two lines

  abc
  äbc

(The first letter in the second line is a lower case "A" with umlaut.)
While moving the cursor over the different characters on the first line
the status line shows "[1/1]", "[2/2]", and "[3/3]", respectively,
telling you that "column" and "virtual column" are equal. That is the
expected behaviour as long as there are no special characters like tabs
and non-printable characters.

Now move the cursor over the characters in the second line. While the
cursor is over the "ä" "[1/1]" is displayed, but the next characters
result in "[3/2]" and "[4/3]", respectively. It seems as if "ä" (or any
non-ASCII character, for that matter) is accounting for (at least) two
columns while encoding is set to "utf-8". Although I know that "ä" is
represented by two bytes in UTF-8 encoding, I find this behaviour
irritating because on the surface it's only one character. It even gets
worse (IMHO) with characters that need three bytes in UTF-8 encoding,
like LATIN CAPITAL LETTER A WITH DOT BELOW (0x1EA0), which increase the
column number by three.

Also the "cursor()" function shows this kind of interpretation of
non-ASCII characters. Both

  call cursor(2, 1)

and

  call cursor(2, 2)

place the cursor on "ä". To place it on "b" you need to

  call cursor(2, 3)

although I would expect that already the second example would place the
cursor on "b".

I can think of two ways to circumvent this problem:

  1) switching to "encoding=latin1", which is not always an option
 because of the need for characters outside the scope of latin1;

  2) using only virtual column numbers in the status line, but this
 gives different results when characters like tab or non-printables
 are displayed in more than one screen cell (which is of course
 reasonable).

I don't know whether the shown behaviour is a bug or just a feature I
don't like, but in summary I think "column number" should really
represent a character count (i.e, corresponding to what the user sees),
not a byte count depending on the underlying encoding.

I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing
the code will definitely introduce an incompatibility. So the final
question is: What do you (Vimmers) and you (Bram) think: is there a need
for a change.

Regards,
Jürgen

-- 
Jürgen Krämer  Softwareentwicklung
HABEL GmbH & Co. KGmailto:[EMAIL PROTECTED]
Hinteres Öschle 2  Tel: +49 / 74 61 / 93 53 - 15
78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread James Vega
On Wed, Jul 05, 2006 at 11:50:51AM +0200, Jürgen Krämer wrote:
> 
> Hi,
> 
> with 'encoding' set to "utf-8" there is a quite confusing (to me)
> difference between the column number and my expectations (supported by
> the virtual column number) if there are non-ASCII characters on the
> line.

Column number n is really the nth byte on that line.  This is described
at ":help /\%c".  This description should explain all the behavior
you're seeing.  This is the intended behavior and I'm not sure of a way
off-hand to get the visual character count like you want.

James
-- 
GPG Key: 1024D/61326D40 2003-09-02 James Vega <[EMAIL PROTECTED]>


signature.asc
Description: Digital signature


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Jürgen Krämer

Hi,

James Vega wrote:
>
> On Wed, Jul 05, 2006 at 11:50:51AM +0200, Jürgen Krämer wrote:
>>
>> with 'encoding' set to "utf-8" there is a quite confusing (to me)
>> difference between the column number and my expectations (supported by
>> the virtual column number) if there are non-ASCII characters on the
>> line.
>
> Column number n is really the nth byte on that line.  This is described
> at ":help /\%c".  This description should explain all the behavior
> you're seeing.  This is the intended behavior and I'm not sure of a way
> off-hand to get the visual character count like you want.

yes, it does *explain* the behaviour. But it makes things even worse.
Suppose I have some lines with aligned data (just like a table) where I
want to replace certain columns with dashes, e.g.,

  PeterTraurig irgendwo  0
  Hänschen Klein   unterwegs 1
  Jürgen   Krämer  hier  2

  :%s/\%18c.*\%27c/-/

should strike out the third column of the table, but the result is

  PeterTraurig - 0
  Hänschen Klein  -s 1
  Jürgen   Krämer-   2

which is depending on the random number of non-ASCII characters in front
of the used position, characters whose internal representations should
never be relevant for this substitution, because the user cannot know
them.

Since it works as documented it is hard to call it a bug, but I would
really consider it a mis-feature, because it works in such a
non-predictable way.

To work around the problem in this example is not that hard -- I can use
/\%...v instead. The example in my original mail poses a bigger problem
(to me) -- I'd like to switch to "encoding=utf-8" as default, but I
often need to work with text files of fixed line length. With encoding
set to "latin1" the difference between column number and virtual column
number in the status line is a visual clue that there is a tabular or a
control code in the line, reminding me to look for this character. With
UTF-8 encoding this hint would be rendered useless because of all those
little umlauts in German. :-(

But perhaps this is just my special problem.

Regards,
Jürgen


-- 
Jürgen Krämer  Softwareentwicklung
HABEL GmbH & Co. KGmailto:[EMAIL PROTECTED]
Hinteres Öschle 2  Tel: +49 / 74 61 / 93 53 - 15
78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Yakov Lerner

On 7/5/06, Jürgen Krämer <[EMAIL PROTECTED]> wrote:

To work around the problem in this example is not that hard -- I can use
/\%...v instead.

Yes


The example in my original mail poses a bigger problem
(to me) -- I'd like to switch to "encoding=utf-8" as default, but I
often need to work with text files of fixed line length. With encoding
set to "latin1" the difference between column number and virtual column
number in the status line is a visual clue that there is a tabular or a
control code in the line, reminding me to look for this character. With
UTF-8 encoding this hint would be rendered useless because of all those
little umlauts in German. :-(


There's yet another reason for col()!=virtcol().

It's unprintable characters like ^A ^@ ^[
Granted, they occur rarely in textfiles, but if they do,
they'll cause virtcol() != col().

If you stick with virtcol() and \%v, you'll
probably not feel any inconvenience. I mean, there are two types
of columns (virtual and non-virtual), and if someone
confuses the two, and uses %\c instead of %\v or col() instead of
virtcol(), or vice versa, it's inconvenient.

Once the confusion is fixed, and you use the right type
of column index, doesn't it solve the inconvenience ?
(except that there are still two types of columns, which
requires increased attention as to which one
to use in each case) ?

Yakov


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Yakov Lerner

On 7/5/06, Jürgen Krämer <[EMAIL PROTECTED]> wrote:

with 'encoding' set to "utf-8" there is a quite confusing (to me)
difference between the column number and my expectations (supported by
the virtual column number) if there are non-ASCII characters on the
line.


And additional remark. As James noted, \%c
is not character offset (in case of multibyte chars),
but the bytes offset.

In case you want to match
not by visual columns (\%v) and not by byte
offset, but by character index in the line, you
can do this:

/^.\{22}xyz

This matches xyz at 23nd char position,
correctly counting each multibyte chars and
each single char for 1 position. Does this
possibly solve your matching problem ?

Yakov


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Bram Moolenaar

Jürgen Krämer wrote:

> with 'encoding' set to "utf-8" there is a quite confusing (to me)
> difference between the column number and my expectations (supported by
> the virtual column number) if there are non-ASCII characters on the
> line. I don't know what the intended meaning of "column count" and the
> intended behaviour of "cursor()" are, but it seems they both depend on
> the size of the encoded characters. I always thought "nth column" was
> more or less a synonym for "nth character on a line" while "nth virtual
> column" meant "nth cell on a screen line".
> 
> Here is how to reproduce the observed behaviour. Start
> 
>vim -u NONE -U NONE
> 
> and
> 
>   :set encoding=utf-8
>   :set laststatus=2
>   :set statusline=[%c/%v]
> 
> (The last line tells VIM to display the column and the virtual column.)
> Now enter two lines
> 
>   abc
>   äbc
> 
> (The first letter in the second line is a lower case "A" with umlaut.)
> While moving the cursor over the different characters on the first line
> the status line shows "[1/1]", "[2/2]", and "[3/3]", respectively,
> telling you that "column" and "virtual column" are equal. That is the
> expected behaviour as long as there are no special characters like tabs
> and non-printable characters.
> 
> Now move the cursor over the characters in the second line. While the
> cursor is over the "ä" "[1/1]" is displayed, but the next characters
> result in "[3/2]" and "[4/3]", respectively. It seems as if "ä" (or any
> non-ASCII character, for that matter) is accounting for (at least) two
> columns while encoding is set to "utf-8". Although I know that "ä" is
> represented by two bytes in UTF-8 encoding, I find this behaviour
> irritating because on the surface it's only one character. It even gets
> worse (IMHO) with characters that need three bytes in UTF-8 encoding,
> like LATIN CAPITAL LETTER A WITH DOT BELOW (0x1EA0), which increase the
> column number by three.
> 
> Also the "cursor()" function shows this kind of interpretation of
> non-ASCII characters. Both
> 
>   call cursor(2, 1)
> 
> and
> 
>   call cursor(2, 2)
> 
> place the cursor on "ä". To place it on "b" you need to
> 
>   call cursor(2, 3)
> 
> although I would expect that already the second example would place the
> cursor on "b".
> 
> I can think of two ways to circumvent this problem:
> 
>   1) switching to "encoding=latin1", which is not always an option
>  because of the need for characters outside the scope of latin1;
> 
>   2) using only virtual column numbers in the status line, but this
>  gives different results when characters like tab or non-printables
>  are displayed in more than one screen cell (which is of course
>  reasonable).
> 
> I don't know whether the shown behaviour is a bug or just a feature I
> don't like, but in summary I think "column number" should really
> represent a character count (i.e, corresponding to what the user sees),
> not a byte count depending on the underlying encoding.
> 
> I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing
> the code will definitely introduce an incompatibility. So the final
> question is: What do you (Vimmers) and you (Bram) think: is there a need
> for a change.

I don't know why you call this a column count, in most places it's
called a byte count.  Perhaps in some places in the docs the remark
about this actually being a byte count is missing.

You could also want a character count.  But what is a character when
using composing characters?  E.g., when the umlaut is not included in
a character but added as a separate composing character?

It's not so obvious what to do.  In these situations I rather keep it as
it is.

-- 
DENNIS: Look,  strange women lying on their backs in ponds handing out
swords ... that's no basis for a system of government.  Supreme
executive power derives from a mandate from the masses, not from some
farcical aquatic ceremony.
 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Jürgen Krämer

Hi,

Bram Moolenaar wrote:
>
> Jürgen Krämer wrote:
>
>> with 'encoding' set to "utf-8" there is a quite confusing (to me)
>> difference between the column number and my expectations (supported by
>> the virtual column number) if there are non-ASCII characters on the
>> line. I don't know what the intended meaning of "column count" and the
>> intended behaviour of "cursor()" are, but it seems they both depend on
>> the size of the encoded characters. I always thought "nth column" was
>> more or less a synonym for "nth character on a line" while "nth virtual
>> column" meant "nth cell on a screen line".
>>
[snipped
>>
>> I don't know whether the shown behaviour is a bug or just a feature I
>> don't like, but in summary I think "column number" should really
>> represent a character count (i.e, corresponding to what the user sees),
>> not a byte count depending on the underlying encoding.
>>
>> I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing
>> the code will definitely introduce an incompatibility. So the final
>> question is: What do you (Vimmers) and you (Bram) think: is there a need
>> for a change.
>
> I don't know why you call this a column count, in most places it's
> called a byte count.  Perhaps in some places in the docs the remark
> about this actually being a byte count is missing.

sorry, the "column count" in the first paragraph should have been a
"column number". I called it so because I have the statusline option set
to

  %<%f%= [%1*%M%*%{','.&fileformat}%R%Y] [%6l,%4c%V] %3b=0x%02B %P

and noticed that "%4c-%V" displayed two numbers instead of the one I
expected, because I knew there were no tabs or unprintable characters
on that line. Even more disturbing was the fact that the first number
(the column number) was bigger than the second one (the virtual column
number). So I checked ":help statusline" and it told me

c N   Column number.
v N   Virtual column number.
V N   Virtual column number as -{num}.  Not displayed if equal to 'c'.

> You could also want a character count.  But what is a character when
> using composing characters?  E.g., when the umlaut is not included in
> a character but added as a separate composing character?

I would say that a character is what the user sees. Why should he (be
forced to) know wheter "ä" is represented internally as LATIN SMALL
LETTER A WITH DIAERESIS or as LATIN SMALL LETTER A plus COMBINING
DIARESIS? So in my opinion "column count" is equivalent to "character
count" unless there are characters like tabs and unprintable ones that
have a special representation -- on the screen, not internally.

> It's not so obvious what to do.  In these situations I rather keep it as
> it is.

I know it's a big change and would introduce imcompatibiliy with older
versions, but here is another example: Take this line (ignoring the
leading spaces)

  ääbbcc

and the following commands

  :s/\%3c../xx/
  %s/^..\zs../xx/

>From my point of view they should both replace the 3rd and 4th column
with "xx". When encoding is set to latin1 they do, but not when it is
set to utf-8 -- the first one replaces "äb" with "xx". As a user I would
be really stumbled and ask "Why that, it's the same text as before."

Changing these commands to

  :s/\%2c../xx/
  %s/^.\zs../xx/

makes things even more irritating. The second one works as expected, now
correctly replacing "äb" with "xx", but the first one fails with "E486:
Pattern not found: \%2c..". Again: Ought I (as a user) really need to
know that \%2c depends on the number of non-ASCII letters in front of
the column I'm interested in?

Regards,
Jürgen

-- 
Jürgen Krämer  Softwareentwicklung
HABEL GmbH & Co. KGmailto:[EMAIL PROTECTED]
Hinteres Öschle 2  Tel: +49 / 74 61 / 93 53 - 15
78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99


RE: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Zdenek Sekera
> -Original Message-
> From: Jürgen Krämer [mailto:[EMAIL PROTECTED] 
> Sent: 06 July 2006 08:01
> To: vim mailing list
> Subject: Re: Irritating column numbers with encoding=utf-8
> 
> 
> Hi,
> 
> Bram Moolenaar wrote:
> >
> > Jürgen Krämer wrote:
> >
> >> with 'encoding' set to "utf-8" there is a quite confusing (to me)
> >> difference between the column number and my expectations 
> (supported by
> >> the virtual column number) if there are non-ASCII characters on the
> >> line. I don't know what the intended meaning of "column 
> count" and the
> >> intended behaviour of "cursor()" are, but it seems they 
> both depend on
> >> the size of the encoded characters. I always thought "nth 
> column" was
> >> more or less a synonym for "nth character on a line" while 
> "nth virtual
> >> column" meant "nth cell on a screen line".
> >>
> [snipped
> >>
> >> I don't know whether the shown behaviour is a bug or just 
> a feature I
> >> don't like, but in summary I think "column number" should really
> >> represent a character count (i.e, corresponding to what 
> the user sees),
> >> not a byte count depending on the underlying encoding.
> >>
> >> I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, 
> so changing
> >> the code will definitely introduce an incompatibility. So the final
> >> question is: What do you (Vimmers) and you (Bram) think: 
> is there a need
> >> for a change.
> >
> > I don't know why you call this a column count, in most places it's
> > called a byte count.  Perhaps in some places in the docs the remark
> > about this actually being a byte count is missing.
> 
> sorry, the "column count" in the first paragraph should have been a
> "column number". I called it so because I have the statusline 
> option set
> to
> 
>   %<%f%= [%1*%M%*%{','.&fileformat}%R%Y] [%6l,%4c%V] %3b=0x%02B %P
> 
> and noticed that "%4c-%V" displayed two numbers instead of the one I
> expected, because I knew there were no tabs or unprintable characters
> on that line. Even more disturbing was the fact that the first number
> (the column number) was bigger than the second one (the virtual column
> number). So I checked ":help statusline" and it told me
> 
>   c N   Column number.
>   v N   Virtual column number.
>   V N   Virtual column number as -{num}.  Not displayed 
> if equal to 'c'.
> 
> > You could also want a character count.  But what is a character when
> > using composing characters?  E.g., when the umlaut is not 
> included in
> > a character but added as a separate composing character?
> 
> I would say that a character is what the user sees. Why should he (be
> forced to) know wheter "ä" is represented internally as LATIN SMALL
> LETTER A WITH DIAERESIS or as LATIN SMALL LETTER A plus COMBINING
> DIARESIS? So in my opinion "column count" is equivalent to "character
> count" unless there are characters like tabs and unprintable ones that
> have a special representation -- on the screen, not internally.
> 
> > It's not so obvious what to do.  In these situations I 
> rather keep it as
> > it is.
> 
> I know it's a big change and would introduce imcompatibiliy with older
> versions, but here is another example: Take this line (ignoring the
> leading spaces)
> 
>   ääbbcc
> 
> and the following commands
> 
>   :s/\%3c../xx/
>   %s/^..\zs../xx/
> 
> From my point of view they should both replace the 3rd and 4th column
> with "xx". When encoding is set to latin1 they do, but not when it is
> set to utf-8 -- the first one replaces "äb" with "xx". As a 
> user I would
> be really stumbled and ask "Why that, it's the same text as before."
> 
> Changing these commands to
> 
>   :s/\%2c../xx/
>   %s/^.\zs../xx/
> 
> makes things even more irritating. The second one works as 
> expected, now
> correctly replacing "äb" with "xx", but the first one fails 
> with "E486:
> Pattern not found: \%2c..". Again: Ought I (as a user) really need to
> know that \%2c depends on the number of non-ASCII letters in front of
> the column I'm interested in?

Yes, this is indeed very unexpected IMHO and as you say
mighty irritating. I find it very hard to disagree with
your arguments. This should be changed IMHO, even if 
it surely is a big change.

---Zdenek