Hello, On Sun, May 13, 2018 at 11:05:11PM +0100, Philip Rowlands wrote: > On Sun, 13 May 2018, at 02:55, Peng Yu wrote: > > The following example shows that `wc -m` is even slower than the > > equivalent Python code. Can this performance bug be fixed? > > I can reproduce the slow wc behaviour with UTF-8 enabled locales.
As this thread expands, it is important to be as precise as possible as to what is observed, and in which environments. > $ seq 1000000 | time -p wc -c > $ seq 1000000 | time -p wc -m > $ seq 1000000 | LANG=C time -p wc -m [...] > In the slow case, wc is spending most of its time in iswprint / wcwidth / > iswspace. So far we observed the followings when using gnu coreutils' wc: 1. running "wc -m" in multibyte locale will always be slower than "wc -c". 2. running "wc -c" should take more or less the same time as "LC_ALL=C wc -m". 3. Under GNU/Linux in multibyte locale, "wc -m" is faster than the attached python script (wcm.py). What Peng Yu reported is that in Mac OS X with multibyte locale the python script is faster than gnu's "wc -m". I currently do not have access to a Mac OSX machine. Testing on FreeBSD (which should be similar enough) I still can not reproduce this issue (ie. I find gnu's "wc" is faster than "wcm.py" in all circumstances). Phil, When you write "slow", do you mean that "wc -m" was slower than running a python script? or slower than "wc -c" ? If python script, can you provide more information about your environment (OS, python version, wc --version, locale) ? > Perhaps wc could learn a faster method of counting utf-8 > (https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to > utf-8 > everywhere marches on. There are many UTF8-specific optimizations, and gnulib has many of them implemented. But using the POSIX standard multibyte functions (e.g. iswprint/wcwidth) ensures 'wc' works not only in UTF8 but in all multibyte locales. There is always a possibility of adding yet more code for UTF8 specific inputs - there are pros and cons to that approach. regards, - assaf
