On Friday 28 February 2003 09:03 pm, John Kilbourne wrote: > I am a beginner as well, with the task of finding and counting the > non-ascii characters in a utf-8 text. How do I do this?
That depends on what you want to accomplish. Counting Unicode code points is easy. ASCII characters have the form 0x0bbbbbbb in UTF-8. Initial bytes of non-ASCII character encodingss have the form 10bbbbbb. All other bytes in UTF-8 streams have the form 11bbbbbb. So matching the range 10000000-10111111 (hex 80-BF) will suffice. If you want to count text characters while ignoring control characters and undefined code points, or to count base character +modifier sequences as single characters, or to count glyphs in the rendering, you need to have a precise set of definitions suited to your application, and know a good deal about the details of Unicode. -- Edward Cherlin Generalist & activist--Linux, languages, literacy and more "A knot! Oh, do let me help to undo it!" --Alice in Wonderland