'wc -m' and combining characters

Nick Sun, 10 Mar 2024 08:39:28 -0700

I'm attempting to learn about UTF-8.  My question is about how wc
counts "combining characters", as discussed here
<https://www.cl.cam.ac.uk/~mgk25/unicode.html#comb>.


I made two files, one with "LATIN CAPITAL LETTER A WITH DIAERESIS"
called p1.txt.  The other with "LATIN CAPITAL LETTER A" followed by
"COMBINING DIAERESIS", called p2.txt.  Neither file contained a
newline or any other bytes.

   $ od --format=x1 p1.txt
   0000000 c3 84
   0000002
   $ od --format=x1 p2.txt
   0000000 41 cc 88
   0000003

My question is: why does wc say that p2.txt contains two characters?

   $ wc -m -c p?.txt
   1 2 p1.txt
   2 3 p2.txt
   3 5 total

I'd naively expected that second line of output to start with 1,
i.e. saying the file p2.txt has one character.  Markus Kuhn's FAQ says
"A combining character is not a full character by itself" but wc is
saying that it is, no?

Sorry if this has already been done to death.  My search of the archives
failed to find a previous discussion but perhaps I missed them.

Thanks
-- 
Nick
Asunción 12:04 PYST ►  37°C  ◆  nubes  ◆  3Km/h NE  ◆  52% HR

'wc -m' and combining characters

Reply via email to