On 2018-05-13 15:05, Philip Rowlands wrote:
In the slow case, wc is spending most of its time in iswprint / wcwidth / iswspace. Perhaps wc could learn a faster method of counting utf-8 (https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to utf-8 everywhere marches on.I can't explain without more digging why Python's string decode('utf-8') is better optimised for length calculations.
On the surface, it seems easy to explain: the Python program is just decoding UTF-8 and then taking the length. None of that requires character classification and determination of display width. If "wc -m" is doing something with display with, it's very different from what the Python is doing. What are the requirements underpinning "wc -m", and how do these iswprint and iswspace functions fit into it? POSIX says this: "The -c option stands for "character" count, even though it counts bytes. This stems from the sometimes erroneous historical view that bytes and characters are the same size. Due to international requirements, the -m option (reminiscent of "multi-byte") was added to obtain actual character counts." I don't see how this amounts to having to call iswspace and all that. Nowhere does POSIX say that the display width of a character has to be obtained in "wc" and I don't see that in the GNU documentation either.
