Jan Engelhardt wrote: > > https://bugzilla.novell.com/show_bug.cgi?id=381873 > > Forwarding this because it is a GNU issue, not specifically a Novell one. > I reproduced this myself with the latest coreutils from git > (BTW: You might want to repack that repo, "counting objects" during the > clone was rather slow in the initial counting.) > > Could it be a libiconv problem?
Accounting for multibyte characters is what's taking the time: ~/git/coreutils/src$ time ./wc -m long_lines.txt 13357046 long_lines.txt real 0m1.860s ~/git/coreutils/src$ time ./wc -c long_lines.txt 13538735 long_lines.txt real 0m0.002s Now that is a _lot_ of extra time. libiconv could probably be made more efficient. I've never actually looked at it. However wc calls mbrtowc() for each multibyte character. It would probably be a lot more efficient to use mbstowcs() to convert the whole read buffer. Note mbstowcs doesn't handle embedded NULs so one would need to find these first, and iterate over each substring, as I did in my version of uniq previously mentioned. Also mbstowcs doesn't canonicalize equivalent multibyte sequences, and so therefore functions the same in this regard as our processing of each wide character separately. This could be considered a bug actually- i.e. should -m give the number of wide chars, or the number of multibyte chars? With the attached patch, `wc -m` gives 23 chars for both these lines. canonically équivalent canonically équivalent Pádraig. p.s. I Notice that gnome-terminal still doesn't handle combining characters correctly, and my mail client thunderbird is putting the accent on the q rather than the e, sigh.
diff --git a/src/wc.c b/src/wc.c index 61ab485..f7f7109 100644 --- a/src/wc.c +++ b/src/wc.c @@ -368,6 +368,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus) linepos += width; if (iswspace (wide_char)) goto mb_word_separator; + else if (width == 0) + chars--; /* don't count combining chars */ in_word = true; } break;
_______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils