Bug#446894: less: -R option causes a crash by (at least) some Japanese contents

Colin Watson Tue, 08 Jan 2008 02:04:58 -0800

tags 446894 patch
thanks

On Tue, Oct 16, 2007 at 10:09:51PM +0900, Kenshi Muto wrote:
> I reported from a user who noticed less always crashed when he used
> less as PAGER of man.


Thanks for reporting this. I had also been noticing this while testing
man-db, but hadn't yet got round to isolating it and filing a bug.

> After my quick looking, I noticed -Rm option of less caused this crash.
> 
> Usually man calls less like this (I'm using LANG=ja_JP.UTF-8):
> ----
> zcat < /usr/share/man/ja/man8/apt-get.8.gz| 
> /usr/bin/zsoelim|/usr/lib/man-db/manconv -f UTF-8:EUC-JP -t UTF-8|tbl|nroff 
> -mandoc -Tutf8 | less -ix8RmPm
> ----
> 
> In this case, around line #183 of ja/apt-get.8 hits something.
> 
> ----
> Program received signal SIGSEGV, Segmentation fault.
> pr_expand (proto=0x0, maxwidth=78) at prompt.c:486
> 486             if (*proto == '\0')
> (gdb) bt
> #0  pr_expand (proto=0x0, maxwidth=78) at prompt.c:486
> #1  0x0000000000411548 in pr_string () at prompt.c:570
> #2  0x000000000040736c in commands () at command.c:653
> #3  0x0000000000402146 in main (argc=1, argv=0x7fff6e69d8a8) at main.c:286
> ----
> 
> 'less -ix8Pm' or other pagers (lv or more) didn't go a crash.

I attacked this with valgrind.

It turns out that the problem appears when less is trying to identify
ANSI escape sequences by stepping back through the line character by
character. Several of the macros used in less.h to test properties of
characters cast them to unsigned char first for the benefit of ctype
functions (some have preferred wctype alternatives, but not all); this
is fine since that's what the ctype functions require, and the ASCII
values being tested for in those cases without wctype alternatives
(space and digits) cannot appear as the final byte of a multibyte UTF-8
sequence.

However, there is one case in less.h where this naïve approach is
incorrect. The IS_CSI_START macro casts its input to unsigned char
before comparing it with CSI (== 0x9b). This *can* appear as the final
byte of a multibyte UTF-8 sequence, and indeed does appear as such in
your sample file. Since the bytes immediately after that are of course
not valid elements of an ANSI escape sequence, store_char wants to
remove the invalid sequence (lines 601-609). It does this by stepping
back through the string character by character until it finds the start.
Unfortunately it appears to miss the 0x9b byte it found earlier - I
think because in_ansi_esc_seq and store_char have slightly different
boundary conditions at the end of the string - and thus ends up stepping
off the beginning of the string. Chaos ensues.

So, the first thing to do is to change IS_CSI_START to cast to LWCHAR
rather than unsigned char. Since LWCHAR is unsigned long, integer
promotion still works the right way and comparison with CSI will work,
but this prevents non-ASCII characters from looking like CSI.

Even after fixing that, the CSI-removal code is still incorrect, because
it steps back through the string byte by byte. The rest of the job
involves changing this to step character by character instead.

The attached patch less.446894.diff fixes these problems.


After writing this, it occurred to me to look for newer upstream
releases. less-418 actually does address similar bugs, and Kenshi's
sample file works correctly with less-418. However, it does this by
disabling the search for CSI in UTF-8 mode, and it still steps back
through the string byte by byte, although it does now have a check to
avoid stepping off the start. The former seems odd (why shouldn't CSI be
allowed to work in UTF-8 mode? U+009B still means CSI according to the
Unicode database), and if you fix that then you still have to fix the
invalid sequence removal code. Stepping back character by character
seems more elegant anyway.

Mark, would you consider the attached less-418.csi.diff?

Thanks,

-- 
Colin Watson                                       [EMAIL PROTECTED]

--- less-416.orig/less.h
+++ less-416/less.h
@@ -159,7 +159,7 @@
 #define IS_DIGIT(c)	((c) >= '0' && (c) <= '9')
 #endif
 
-#define IS_CSI_START(c)	((c) == ESC || ((unsigned char)(c)) == CSI)
+#define IS_CSI_START(c)	((c) == ESC || ((LWCHAR)(c)) == CSI)
 
 #ifndef NULL
 #define	NULL	0
--- less-416.orig/line.c
+++ less-416/line.c
@@ -600,9 +600,12 @@
 	{
 		if (!is_ansi_end(ch) && !is_ansi_middle(ch)) {
 			/* Remove whole unrecognized sequence.  */
+			char *p = &linebuf[curr];
+			LWCHAR chansi;
 			do {
-				--curr;
-			} while (!IS_CSI_START(linebuf[curr]));
+				chansi = step_char(&p, -1, linebuf);
+			} while (p > linebuf && !IS_CSI_START(chansi));
+			curr -= &linebuf[curr] - p;
 			return 0;
 		}
 		a = AT_ANSI;	/* Will force re-AT_'ing around it.  */

--- less-418.orig/less.h	2007-12-04 20:23:30.000000000 +0000
+++ less-418/less.h	2008-01-08 09:55:58.000000000 +0000
@@ -159,7 +159,7 @@
 #define IS_DIGIT(c)	((c) >= '0' && (c) <= '9')
 #endif
 
-#define IS_CSI_START(c)	((c) == ESC || (!utf_mode && ((unsigned char)(c)) == CSI))
+#define IS_CSI_START(c)	((c) == ESC || ((LWCHAR)(c)) == CSI)
 
 #ifndef NULL
 #define	NULL	0
--- less-418.orig/line.c	2007-12-10 16:35:52.000000000 +0000
+++ less-418/line.c	2008-01-08 09:56:38.000000000 +0000
@@ -600,11 +600,12 @@
 	{
 		if (!is_ansi_end(ch) && !is_ansi_middle(ch)) {
 			/* Remove whole unrecognized sequence.  */
+			char *p = &linebuf[curr];
+			LWCHAR chansi;
 			do {
-				if (curr == 0)
-					break;
-				--curr;
-			} while (!IS_CSI_START(linebuf[curr]));
+				chansi = step_char(&p, -1, linebuf);
+			} while (p > linebuf && !IS_CSI_START(chansi));
+			curr -= &linebuf[curr] - p;
 			return 0;
 		}
 		a = AT_ANSI;	/* Will force re-AT_'ing around it.  */

Bug#446894: less: -R option causes a crash by (at least) some Japanese contents

Reply via email to