I'm attempting to learn about UTF-8. My question is about how wc counts "combining characters", as discussed here <https://www.cl.cam.ac.uk/~mgk25/unicode.html#comb>.
I made two files, one with "LATIN CAPITAL LETTER A WITH DIAERESIS" called p1.txt. The other with "LATIN CAPITAL LETTER A" followed by "COMBINING DIAERESIS", called p2.txt. Neither file contained a newline or any other bytes. $ od --format=x1 p1.txt 0000000 c3 84 0000002 $ od --format=x1 p2.txt 0000000 41 cc 88 0000003 My question is: why does wc say that p2.txt contains two characters? $ wc -m -c p?.txt 1 2 p1.txt 2 3 p2.txt 3 5 total I'd naively expected that second line of output to start with 1, i.e. saying the file p2.txt has one character. Markus Kuhn's FAQ says "A combining character is not a full character by itself" but wc is saying that it is, no? Sorry if this has already been done to death. My search of the archives failed to find a previous discussion but perhaps I missed them. Thanks -- Nick Asunción 12:04 PYST ► 37°C ◆ nubes ◆ 3Km/h NE ◆ 52% HR