Jan Engelhardt wrote:
> 
> https://bugzilla.novell.com/show_bug.cgi?id=381873
> 
> Forwarding this because it is a GNU issue, not specifically a Novell one.
> I reproduced this myself with the latest coreutils from git
> (BTW: You might want to repack that repo, "counting objects" during the
> clone was rather slow in the initial counting.)
> 
> Could it be a libiconv problem?

Accounting for multibyte characters is what's taking the time:

~/git/coreutils/src$ time ./wc -m long_lines.txt
13357046 long_lines.txt
real    0m1.860s

~/git/coreutils/src$ time ./wc -c long_lines.txt
13538735 long_lines.txt
real    0m0.002s

Now that is a _lot_ of extra time. libiconv could probably be
made more efficient. I've never actually looked at it.
However wc calls mbrtowc() for each multibyte character.
It would probably be a lot more efficient to use mbstowcs()
to convert the whole read buffer.

Note mbstowcs doesn't handle embedded NULs so one would
need to find these first, and iterate over each substring,
as I did in my version of uniq previously mentioned.

Also mbstowcs doesn't canonicalize equivalent multibyte sequences,
and so therefore functions the same in this regard as our
processing of each wide character separately.
This could be considered a bug actually- i.e. should -m give
the number of wide chars, or the number of multibyte chars?
With the attached patch, `wc -m` gives 23 chars for both these lines.

canonically équivalent
canonically équivalent

Pádraig.

p.s. I Notice that gnome-terminal still doesn't handle
combining characters correctly, and my mail client thunderbird
is putting the accent on the q rather than the e, sigh.
diff --git a/src/wc.c b/src/wc.c
index 61ab485..f7f7109 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -368,6 +368,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
 			    linepos += width;
 			  if (iswspace (wide_char))
 			    goto mb_word_separator;
+			  else if (width == 0)
+			    chars--; /* don't count combining chars */
 			  in_word = true;
 			}
 		      break;
_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Reply via email to