Hi Scott, Scott Vanderbilt wrote on Thu, Aug 11, 2016 at 12:58:17PM -0700:
> I'm trying to use sed to munge some text in HTML files, converting > Unicode characters to their HTML entity equivalents, however I can't > seem to get it to work. > > For instance, this command has no apparent effect: > > sed -i -e 's/\xe2\x80\x94/—/g' foo.html > > Other sed operations using ASCII arguments work fine. > > Does sed support Unicode in this fashion? Our sed(1) does not have *explicit* UTF-8 support yet. That means, /./ will not match a multibyte character, but /../ will match a character if its UTF-8 representation is two bytes long, or the last byte of one character together with the first of the next. [-] ranges will not work with UTF-8 characters, //i case folding will not work, and so on and so forth... However, you can still use sed(1) for your job by simply treating UTF-8 characters as any ordinary byte string. I suspect your problem is that the way you enter the multibyte characters is incorrect, and the line shown above doesn't actually contain UTF-8, but only ASCII: '\\', 'x', 'e' and so on. Let me show you an example that does work: $ hexdump -C input.utf8 00000000 3e c3 a4 3c 0a |>..<.| 00000005 $ hexdump -C script.sed 00000000 73 2f c3 a4 2f 61 65 2f 0a |s/../ae/.| 00000009 $ schwarze@isnote $ sed -f script.sed input.utf8 | hexdump -C 00000000 3e 61 65 3c 0a |>ae<.| 00000005 Note how the U+00E4 = 0xc3a4 = LATIN SMALL LETTER A WITH DIAERESIS gets replaced. With that help, you ought to be able to get your task done. > The sed(1) man page is silent. That's because nothing was done yet to make sed(1) aware of UTF-8. > The FAQ section on Character Sets > <http://www.openbsd.org/faq/faq10.html#locales> indicates that: > > OpenBSD uses the ASCII character set by default. Uh oh. Ah, hrm. Well, kind of, but not really. The LC_CTYPE locale defaults to "C", but that's required for any POSIX-conforming operating system. By default, ksh(1) emacs editing mode partly supports UTF-8, even when LC_CTYPE is C, but ksh(1) vi editing mode does not yet (i have a partial patch for that). By default, xterm(1) and pod2man(1) run in UTF-8 mode on OpenBSD, while they default to strange hybrids of ASCII and ISO-LATIN-1 elsewhere. man(1) always fully supports UTF-8 input, but avoids it for output unless you set LC_CTYPE to SOMETHING.UTF-8 or pass it the -Tutf8 flag. And so on for many programs... Even to describe the default for one single program, saying nothing but a single word "ASCII" or "UTF-8" is usually insufficient, and different programs are very different. Talking about "the" default makes no sense, really. > It also supports the Unicode (UTF-8) character set. Ooops! Do we really say that? That's a bold claim... :-o In a way, it is true. You can do many things with UTF-8 characters, and arguably, that wouldn't be possible if UTF-8 weren't supported, right? Then again, it is not completely true. There are still many tools that do not fully support UTF-8, and some that don't at all. > but I'm not sure what bearing that has on this issue. You are exactly right! That statement is so imprecise that it is completely unclear what it is: more or less true, a bold lie, or a sweeping generalization? ... Now i'm starting to feel curious. Let me read on: "The list of supported locales can be obtained by running the command: locale -a" YIKES!! It looks like i urgently have to fix that part of the FAQ. As i stands, it is spreading FAQ: Fear, Ancertainty, and Quoubt. Yours, Ingo