On Thu, Aug 11, 2016 at 11:51:06PM +0200, Ingo Schwarze wrote: > Hi Scott, > > Scott Vanderbilt wrote on Thu, Aug 11, 2016 at 12:58:17PM -0700: > > > I'm trying to use sed to munge some text in HTML files, converting > > Unicode characters to their HTML entity equivalents, however I can't > > seem to get it to work. > > > > For instance, this command has no apparent effect: > > > > sed -i -e 's/\xe2\x80\x94/—/g' foo.html > > > > Other sed operations using ASCII arguments work fine. > > > > Does sed support Unicode in this fashion? > > Our sed(1) does not have *explicit* UTF-8 support yet. > That means, /./ will not match a multibyte character, but /../ will > match a character if its UTF-8 representation is two bytes long, > or the last byte of one character together with the first of the > next. [-] ranges will not work with UTF-8 characters, //i case > folding will not work, and so on and so forth... > > However, you can still use sed(1) for your job by simply > treating UTF-8 characters as any ordinary byte string. > > I suspect your problem is that the way you enter the multibyte > characters is incorrect, and the line shown above doesn't actually > contain UTF-8, but only ASCII: '\\', 'x', 'e' and so on. > > Let me show you an example that does work: > > $ hexdump -C input.utf8 > 00000000 3e c3 a4 3c 0a |>..<.| > 00000005 > $ hexdump -C script.sed > 00000000 73 2f c3 a4 2f 61 65 2f 0a |s/../ae/.| > 00000009 > $ schwarze@isnote $ sed -f script.sed input.utf8 | hexdump -C > 00000000 3e 61 65 3c 0a |>ae<.| > 00000005 > > Note how the U+00E4 = 0xc3a4 = LATIN SMALL LETTER A WITH DIAERESIS > gets replaced. > > With that help, you ought to be able to get your task done. > > > The sed(1) man page is silent. > > That's because nothing was done yet to make sed(1) aware of UTF-8. > > > The FAQ section on Character Sets > > <http://www.openbsd.org/faq/faq10.html#locales> indicates that: > > > > OpenBSD uses the ASCII character set by default. > > Uh oh. Ah, hrm. Well, kind of, but not really. > > The LC_CTYPE locale defaults to "C", but that's required for any > POSIX-conforming operating system. By default, ksh(1) emacs editing > mode partly supports UTF-8, even when LC_CTYPE is C, but ksh(1) vi > editing mode does not yet (i have a partial patch for that). By > default, xterm(1) and pod2man(1) run in UTF-8 mode on OpenBSD, while > they default to strange hybrids of ASCII and ISO-LATIN-1 elsewhere. > man(1) always fully supports UTF-8 input, but avoids it for output > unless you set LC_CTYPE to SOMETHING.UTF-8 or pass it the -Tutf8 > flag. And so on for many programs... Even to describe the default > for one single program, saying nothing but a single word "ASCII" or > "UTF-8" is usually insufficient, and different programs are very > different. > > Talking about "the" default makes no sense, really. > > > It also supports the Unicode (UTF-8) character set. > > Ooops! Do we really say that? That's a bold claim... :-o > > In a way, it is true. You can do many things with UTF-8 > characters, and arguably, that wouldn't be possible if UTF-8 > weren't supported, right? > > Then again, it is not completely true. There are still many tools > that do not fully support UTF-8, and some that don't at all. > > > but I'm not sure what bearing that has on this issue. > > You are exactly right! That statement is so imprecise that it is > completely unclear what it is: more or less true, a bold lie, or a > sweeping generalization? > > ... > > Now i'm starting to feel curious. Let me read on: > > "The list of supported locales can be obtained by running the > command: locale -a" > > YIKES!! It looks like i urgently have to fix that part of the FAQ. > As i stands, it is spreading FAQ: Fear, Ancertainty, and Quoubt.
In addition to Ingo's advice, you can also use gnu sed (pkg_add gsed) or perl. -- Juan Francisco Cantero Hurtado http://juanfra.info