Re: RFC: wc --max-line-length vs. TABs [Re: Bug in wc

2008-08-23 Thread Jim Meyering
Bruno Haible [EMAIL PROTECTED] wrote:
 Hi Jim,

 This behavior is not specified, and is currently untested.
 (it's a GNU invention, from Bruno Haible in textutils-1.22d,
 which was back in 1997)

 The intention of this option is and was to measure the maximum number of
 screen columns used by a file. For many purposes, people are encouraged
 to create/send/commit files with at most 80 screen columns. Or at most 79
 screen columns for others. Or at most 74 columns for GNU texinfo files.
 The option '-L' was intended as a fast check for this metric.

 The original mail, sent to bug-gnu-utils on 1997-10-31, had this explanation:

   While GNU wc returns the vertical extent of a piece of text - i.e. the
number of lines - it does not yet return the horizontal extent of a piece
of text - i.e. the number of columns. This is a useful functionality, if
you want to know

  - whether a text will fit on the paper when sent to the printer,
  - whether an email exceeds the recommended 72 character limit,
  - (in combination with nm) how long the identifiers were that made
`ranlib' dump core,
  - etc.

 I propose a clarification in the documentation (see below).

Hi Bruno,

Thanks for the patch.  I've applied it.
Indeed, I've used it to check code for lines longer than 80.
Obviously, changing how TAB or multi-byte characters are counted
would break that.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


RFC: wc --max-line-length vs. TABs [Re: Bug in wc

2008-08-22 Thread Jim Meyering
Jim Meyering [EMAIL PROTECTED] wrote:
 Arnaldo Mandel [EMAIL PROTECTED] wrote:
 Dear maintainers,

 There is a bug in the implementation of the -L parameter in wc.
 It is triggered by

 http://www.ime.usp.br/~am/122/eps/gapqm2.gz

 Check this out:

 $ zcat gapqm2.gz |wc -l -c -L
   1 6297954 6353180

 That is, the single line is longer than the whole file.

 This was pointed out by

   William A. M. Gnann [EMAIL PROTECTED]

 Thanks for reporting it and for giving credit.
 FYI, here's a smaller reproducer:

   $ printf '\t'|wc -L
   8

This behavior is not specified, and is currently untested.
(it's a GNU invention, from Bruno Haible in textutils-1.22d,
which was back in 1997)

http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=ab5ff1597f5d734b711fbd95389cefcc8203d51c

I.e., the following change to make --max-line-length (-L)
never count a TAB as more than one byte does not induce
any test failure.

I'm tempted to make the change, but it seems too drastic, after 11 years.
Do any of you rely on the current TAB-counting behavior of GNU wc?

Bruno, what do you think?


diff --git a/src/wc.c b/src/wc.c
index 0bb1929..d44cf96 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -363,7 +363,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
  linepos = 0;
  goto mb_word_separator;
case '\t':
- linepos += 8 - (linepos % 8);
+ linepos++;
  goto mb_word_separator;
case ' ':
  linepos++;
@@ -437,7 +437,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
  linepos = 0;
  goto word_separator;
case '\t':
- linepos += 8 - (linepos % 8);
+ linepos++;
  goto word_separator;
case ' ':
  linepos++;


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: RFC: wc --max-line-length vs. TABs [Re: Bug in wc

2008-08-22 Thread Bo Borgerson
Jim Meyering wrote:
 
 I'm tempted to make the change, but it seems too drastic, after 11 years.
 Do any of you rely on the current TAB-counting behavior of GNU wc?
 

Hi,

It looks like TAB characters aren't alone in being counted by printed
width rather than count:

$ echo '好' | wc -L
2

Does it make sense to change the behavior for TAB, but not for wide
characters?

Bo
diff --git a/src/wc.c b/src/wc.c
index 0bb1929..b3f1ab2 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -378,7 +378,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
 		{
 		  int width = wcwidth (wide_char);
 		  if (width  0)
-			linepos += width;
+			linepos ++;
 		  if (iswspace (wide_char))
 			goto mb_word_separator;
 		  in_word = true;
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: RFC: wc --max-line-length vs. TABs [Re: Bug in wc

2008-08-22 Thread Arnaldo Mandel
Bo Borgerson wrote (on Aug 22, 2008):
  
  Does it make sense to change the behavior for TAB, but not for wide
  characters?

Relying on an undocumented tab length seems bad.  However, on chars I
suggest you just apply the bug-feature operator: document that line
length is in chars, and explain that chars is a locale-dependent
concept.

Just my 2 cents.

am



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: RFC: wc --max-line-length vs. TABs [Re: Bug in wc

2008-08-22 Thread Bruno Haible
Hi Jim,

 This behavior is not specified, and is currently untested.
 (it's a GNU invention, from Bruno Haible in textutils-1.22d,
 which was back in 1997)

The intention of this option is and was to measure the maximum number of
screen columns used by a file. For many purposes, people are encouraged
to create/send/commit files with at most 80 screen columns. Or at most 79
screen columns for others. Or at most 74 columns for GNU texinfo files.
The option '-L' was intended as a fast check for this metric.

The original mail, sent to bug-gnu-utils on 1997-10-31, had this explanation:

  While GNU wc returns the vertical extent of a piece of text - i.e. the
   number of lines - it does not yet return the horizontal extent of a piece
   of text - i.e. the number of columns. This is a useful functionality, if
   you want to know

 - whether a text will fit on the paper when sent to the printer,
 - whether an email exceeds the recommended 72 character limit,
 - (in combination with nm) how long the identifiers were that made
   `ranlib' dump core,
 - etc.

I propose a clarification in the documentation (see below).

 I'm tempted to make the change, but it seems too drastic, after 11 years.
 Do any of you rely on the current TAB-counting behavior of GNU wc?
 
 Bruno, what do you think?

If you change the option to count every tab as 1, or every character as 1
regardless of its screen width, the option -L is not usable for its main
purpose any more.

Bruno


2008-08-22  Bruno Haible  [EMAIL PROTECTED]

* doc/coreutils.texi (wc invocation): Explain what the option -L
measures.

--- coreutils.texi.bak  2008-08-22 23:55:47.0 +0200
+++ coreutils.texi  2008-08-22 23:59:03.0 +0200
@@ -3137,7 +3137,9 @@
 
 With the @option{--max-line-length} option, @command{wc} prints the length
 of the longest line per file, and if there is more than one file it
-prints the maximum (not the sum) of those lengths.
+prints the maximum (not the sum) of those lengths.  The line lengths here
+are measured in screen columns, according to the current locale and
+assuming tab positions in every 8th column.
 
 The program accepts the following options.  Also see @ref{Common options}.
 



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils