Re: Unicode Support in sed?

Ingo Schwarze Thu, 11 Aug 2016 14:52:40 -0700

Hi Scott,

Scott Vanderbilt wrote on Thu, Aug 11, 2016 at 12:58:17PM -0700:


> I'm trying to use sed to munge some text in HTML files, converting
> Unicode characters to their HTML entity equivalents, however I can't
> seem to get it to work.
> 
> For instance, this command has no apparent effect:
> 
>   sed -i -e 's/\xe2\x80\x94/&mdash;/g' foo.html
> 
> Other sed operations using ASCII arguments work fine.
> 
> Does sed support Unicode in this fashion?

Our sed(1) does not have *explicit* UTF-8 support yet.
That means, /./ will not match a multibyte character, but /../ will
match a character if its UTF-8 representation is two bytes long,
or the last byte of one character together with the first of the
next.  [-] ranges will not work with UTF-8 characters, //i case
folding will not work, and so on and so forth...

However, you can still use sed(1) for your job by simply
treating UTF-8 characters as any ordinary byte string.

I suspect your problem is that the way you enter the multibyte
characters is incorrect, and the line shown above doesn't actually
contain UTF-8, but only ASCII: '\\', 'x', 'e' and so on.

Let me show you an example that does work:

   $ hexdump -C input.utf8
  00000000  3e c3 a4 3c 0a                      |>..<.|
  00000005
   $ hexdump -C script.sed           
  00000000  73 2f c3 a4 2f 61 65 2f  0a         |s/../ae/.|
  00000009
   $ schwarze@isnote $ sed -f script.sed input.utf8 | hexdump -C
  00000000  3e 61 65 3c 0a                      |>ae<.|
  00000005

Note how the U+00E4 = 0xc3a4 = LATIN SMALL LETTER A WITH DIAERESIS
gets replaced.

With that help, you ought to be able to get your task done.

> The sed(1) man page is silent.

That's because nothing was done yet to make sed(1) aware of UTF-8.

> The FAQ section on Character Sets
> <http://www.openbsd.org/faq/faq10.html#locales> indicates that:
> 
>    OpenBSD uses the ASCII character set by default.

Uh oh.  Ah, hrm.  Well, kind of, but not really.

The LC_CTYPE locale defaults to "C", but that's required for any
POSIX-conforming operating system.  By default, ksh(1) emacs editing
mode partly supports UTF-8, even when LC_CTYPE is C, but ksh(1) vi
editing mode does not yet (i have a partial patch for that).  By
default, xterm(1) and pod2man(1) run in UTF-8 mode on OpenBSD, while
they default to strange hybrids of ASCII and ISO-LATIN-1 elsewhere.
man(1) always fully supports UTF-8 input, but avoids it for output
unless you set LC_CTYPE to SOMETHING.UTF-8 or pass it the -Tutf8
flag.  And so on for many programs...  Even to describe the default
for one single program, saying nothing but a single word "ASCII" or
"UTF-8" is usually insufficient, and different programs are very
different.

Talking about "the" default makes no sense, really.

> It also supports the Unicode (UTF-8) character set.

Ooops!  Do we really say that?  That's a bold claim...  :-o

In a way, it is true.  You can do many things with UTF-8
characters, and arguably, that wouldn't be possible if UTF-8
weren't supported, right?

Then again, it is not completely true.  There are still many tools
that do not fully support UTF-8, and some that don't at all.

> but I'm not sure what bearing that has on this issue.

You are exactly right!  That statement is so imprecise that it is
completely unclear what it is: more or less true, a bold lie, or a
sweeping generalization?

...

Now i'm starting to feel curious.  Let me read on:

  "The list of supported locales can be obtained by running the
   command:  locale -a"

YIKES!!  It looks like i urgently have to fix that part of the FAQ.
As i stands, it is spreading FAQ:  Fear, Ancertainty, and Quoubt.

Yours,
  Ingo

Re: Unicode Support in sed?

Reply via email to