tags 446894 patch thanks On Tue, Oct 16, 2007 at 10:09:51PM +0900, Kenshi Muto wrote: > I reported from a user who noticed less always crashed when he used > less as PAGER of man.
Thanks for reporting this. I had also been noticing this while testing man-db, but hadn't yet got round to isolating it and filing a bug. > After my quick looking, I noticed -Rm option of less caused this crash. > > Usually man calls less like this (I'm using LANG=ja_JP.UTF-8): > ---- > zcat < /usr/share/man/ja/man8/apt-get.8.gz| > /usr/bin/zsoelim|/usr/lib/man-db/manconv -f UTF-8:EUC-JP -t UTF-8|tbl|nroff > -mandoc -Tutf8 | less -ix8RmPm > ---- > > In this case, around line #183 of ja/apt-get.8 hits something. > > ---- > Program received signal SIGSEGV, Segmentation fault. > pr_expand (proto=0x0, maxwidth=78) at prompt.c:486 > 486 if (*proto == '\0') > (gdb) bt > #0 pr_expand (proto=0x0, maxwidth=78) at prompt.c:486 > #1 0x0000000000411548 in pr_string () at prompt.c:570 > #2 0x000000000040736c in commands () at command.c:653 > #3 0x0000000000402146 in main (argc=1, argv=0x7fff6e69d8a8) at main.c:286 > ---- > > 'less -ix8Pm' or other pagers (lv or more) didn't go a crash. I attacked this with valgrind. It turns out that the problem appears when less is trying to identify ANSI escape sequences by stepping back through the line character by character. Several of the macros used in less.h to test properties of characters cast them to unsigned char first for the benefit of ctype functions (some have preferred wctype alternatives, but not all); this is fine since that's what the ctype functions require, and the ASCII values being tested for in those cases without wctype alternatives (space and digits) cannot appear as the final byte of a multibyte UTF-8 sequence. However, there is one case in less.h where this naïve approach is incorrect. The IS_CSI_START macro casts its input to unsigned char before comparing it with CSI (== 0x9b). This *can* appear as the final byte of a multibyte UTF-8 sequence, and indeed does appear as such in your sample file. Since the bytes immediately after that are of course not valid elements of an ANSI escape sequence, store_char wants to remove the invalid sequence (lines 601-609). It does this by stepping back through the string character by character until it finds the start. Unfortunately it appears to miss the 0x9b byte it found earlier - I think because in_ansi_esc_seq and store_char have slightly different boundary conditions at the end of the string - and thus ends up stepping off the beginning of the string. Chaos ensues. So, the first thing to do is to change IS_CSI_START to cast to LWCHAR rather than unsigned char. Since LWCHAR is unsigned long, integer promotion still works the right way and comparison with CSI will work, but this prevents non-ASCII characters from looking like CSI. Even after fixing that, the CSI-removal code is still incorrect, because it steps back through the string byte by byte. The rest of the job involves changing this to step character by character instead. The attached patch less.446894.diff fixes these problems. After writing this, it occurred to me to look for newer upstream releases. less-418 actually does address similar bugs, and Kenshi's sample file works correctly with less-418. However, it does this by disabling the search for CSI in UTF-8 mode, and it still steps back through the string byte by byte, although it does now have a check to avoid stepping off the start. The former seems odd (why shouldn't CSI be allowed to work in UTF-8 mode? U+009B still means CSI according to the Unicode database), and if you fix that then you still have to fix the invalid sequence removal code. Stepping back character by character seems more elegant anyway. Mark, would you consider the attached less-418.csi.diff? Thanks, -- Colin Watson [EMAIL PROTECTED]
--- less-416.orig/less.h +++ less-416/less.h @@ -159,7 +159,7 @@ #define IS_DIGIT(c) ((c) >= '0' && (c) <= '9') #endif -#define IS_CSI_START(c) ((c) == ESC || ((unsigned char)(c)) == CSI) +#define IS_CSI_START(c) ((c) == ESC || ((LWCHAR)(c)) == CSI) #ifndef NULL #define NULL 0 --- less-416.orig/line.c +++ less-416/line.c @@ -600,9 +600,12 @@ { if (!is_ansi_end(ch) && !is_ansi_middle(ch)) { /* Remove whole unrecognized sequence. */ + char *p = &linebuf[curr]; + LWCHAR chansi; do { - --curr; - } while (!IS_CSI_START(linebuf[curr])); + chansi = step_char(&p, -1, linebuf); + } while (p > linebuf && !IS_CSI_START(chansi)); + curr -= &linebuf[curr] - p; return 0; } a = AT_ANSI; /* Will force re-AT_'ing around it. */
--- less-418.orig/less.h 2007-12-04 20:23:30.000000000 +0000 +++ less-418/less.h 2008-01-08 09:55:58.000000000 +0000 @@ -159,7 +159,7 @@ #define IS_DIGIT(c) ((c) >= '0' && (c) <= '9') #endif -#define IS_CSI_START(c) ((c) == ESC || (!utf_mode && ((unsigned char)(c)) == CSI)) +#define IS_CSI_START(c) ((c) == ESC || ((LWCHAR)(c)) == CSI) #ifndef NULL #define NULL 0 --- less-418.orig/line.c 2007-12-10 16:35:52.000000000 +0000 +++ less-418/line.c 2008-01-08 09:56:38.000000000 +0000 @@ -600,11 +600,12 @@ { if (!is_ansi_end(ch) && !is_ansi_middle(ch)) { /* Remove whole unrecognized sequence. */ + char *p = &linebuf[curr]; + LWCHAR chansi; do { - if (curr == 0) - break; - --curr; - } while (!IS_CSI_START(linebuf[curr])); + chansi = step_char(&p, -1, linebuf); + } while (p > linebuf && !IS_CSI_START(chansi)); + curr -= &linebuf[curr] - p; return 0; } a = AT_ANSI; /* Will force re-AT_'ing around it. */