Re: Unicode Support in sed?

2016-08-11 Thread Juan Francisco Cantero Hurtado
On Thu, Aug 11, 2016 at 11:51:06PM +0200, Ingo Schwarze wrote:
> Hi Scott,
> 
> Scott Vanderbilt wrote on Thu, Aug 11, 2016 at 12:58:17PM -0700:
> 
> > I'm trying to use sed to munge some text in HTML files, converting
> > Unicode characters to their HTML entity equivalents, however I can't
> > seem to get it to work.
> > 
> > For instance, this command has no apparent effect:
> > 
> >   sed -i -e 's/\xe2\x80\x94/—/g' foo.html
> > 
> > Other sed operations using ASCII arguments work fine.
> > 
> > Does sed support Unicode in this fashion?
> 
> Our sed(1) does not have *explicit* UTF-8 support yet.
> That means, /./ will not match a multibyte character, but /../ will
> match a character if its UTF-8 representation is two bytes long,
> or the last byte of one character together with the first of the
> next.  [-] ranges will not work with UTF-8 characters, //i case
> folding will not work, and so on and so forth...
> 
> However, you can still use sed(1) for your job by simply
> treating UTF-8 characters as any ordinary byte string.
> 
> I suspect your problem is that the way you enter the multibyte
> characters is incorrect, and the line shown above doesn't actually
> contain UTF-8, but only ASCII: '\\', 'x', 'e' and so on.
> 
> Let me show you an example that does work:
> 
>$ hexdump -C input.utf8
>     3e c3 a4 3c 0a  |>..<.|
>   0005
>$ hexdump -C script.sed   
>     73 2f c3 a4 2f 61 65 2f  0a |s/../ae/.|
>   0009
>$ schwarze@isnote $ sed -f script.sed input.utf8 | hexdump -C
>     3e 61 65 3c 0a  |>ae<.|
>   0005
> 
> Note how the U+00E4 = 0xc3a4 = LATIN SMALL LETTER A WITH DIAERESIS
> gets replaced.
> 
> With that help, you ought to be able to get your task done.
> 
> > The sed(1) man page is silent.
> 
> That's because nothing was done yet to make sed(1) aware of UTF-8.
> 
> > The FAQ section on Character Sets
> >  indicates that:
> > 
> >OpenBSD uses the ASCII character set by default.
> 
> Uh oh.  Ah, hrm.  Well, kind of, but not really.
> 
> The LC_CTYPE locale defaults to "C", but that's required for any
> POSIX-conforming operating system.  By default, ksh(1) emacs editing
> mode partly supports UTF-8, even when LC_CTYPE is C, but ksh(1) vi
> editing mode does not yet (i have a partial patch for that).  By
> default, xterm(1) and pod2man(1) run in UTF-8 mode on OpenBSD, while
> they default to strange hybrids of ASCII and ISO-LATIN-1 elsewhere.
> man(1) always fully supports UTF-8 input, but avoids it for output
> unless you set LC_CTYPE to SOMETHING.UTF-8 or pass it the -Tutf8
> flag.  And so on for many programs...  Even to describe the default
> for one single program, saying nothing but a single word "ASCII" or
> "UTF-8" is usually insufficient, and different programs are very
> different.
> 
> Talking about "the" default makes no sense, really.
> 
> > It also supports the Unicode (UTF-8) character set.
> 
> Ooops!  Do we really say that?  That's a bold claim...  :-o
> 
> In a way, it is true.  You can do many things with UTF-8
> characters, and arguably, that wouldn't be possible if UTF-8
> weren't supported, right?
> 
> Then again, it is not completely true.  There are still many tools
> that do not fully support UTF-8, and some that don't at all.
> 
> > but I'm not sure what bearing that has on this issue.
> 
> You are exactly right!  That statement is so imprecise that it is
> completely unclear what it is: more or less true, a bold lie, or a
> sweeping generalization?
> 
> ...
> 
> Now i'm starting to feel curious.  Let me read on:
> 
>   "The list of supported locales can be obtained by running the
>command:  locale -a"
> 
> YIKES!!  It looks like i urgently have to fix that part of the FAQ.
> As i stands, it is spreading FAQ:  Fear, Ancertainty, and Quoubt.

In addition to Ingo's advice, you can also use gnu sed (pkg_add gsed) or
perl.

-- 
Juan Francisco Cantero Hurtado http://juanfra.info



Re: Unicode Support in sed?

2016-08-11 Thread Ingo Schwarze
Hi,

Ingo Schwarze wrote on Thu, Aug 11, 2016 at 11:51:06PM +0200:
> Scott Vanderbilt wrote on Thu, Aug 11, 2016 at 12:58:17PM -0700:

>> The FAQ section on Character Sets
>> 
>> indicates that:
[...]
> YIKES!!
> It looks like i urgently have to fix that part of the FAQ.

Done, i rewrote most of it, and it is online now.

There is still much room for adding useful information, but as a
first step, at least all the bad stuff is gone now.

It still doesn't answer your question about sed(1), but even when
coming from that question, the text should be much less misleading
and confusing now.

Yours,
  Ingo



Re: Unicode Support in sed?

2016-08-11 Thread Ingo Schwarze
Hi Scott,

Scott Vanderbilt wrote on Thu, Aug 11, 2016 at 12:58:17PM -0700:

> I'm trying to use sed to munge some text in HTML files, converting
> Unicode characters to their HTML entity equivalents, however I can't
> seem to get it to work.
> 
> For instance, this command has no apparent effect:
> 
>   sed -i -e 's/\xe2\x80\x94/—/g' foo.html
> 
> Other sed operations using ASCII arguments work fine.
> 
> Does sed support Unicode in this fashion?

Our sed(1) does not have *explicit* UTF-8 support yet.
That means, /./ will not match a multibyte character, but /../ will
match a character if its UTF-8 representation is two bytes long,
or the last byte of one character together with the first of the
next.  [-] ranges will not work with UTF-8 characters, //i case
folding will not work, and so on and so forth...

However, you can still use sed(1) for your job by simply
treating UTF-8 characters as any ordinary byte string.

I suspect your problem is that the way you enter the multibyte
characters is incorrect, and the line shown above doesn't actually
contain UTF-8, but only ASCII: '\\', 'x', 'e' and so on.

Let me show you an example that does work:

   $ hexdump -C input.utf8
    3e c3 a4 3c 0a  |>..<.|
  0005
   $ hexdump -C script.sed   
    73 2f c3 a4 2f 61 65 2f  0a |s/../ae/.|
  0009
   $ schwarze@isnote $ sed -f script.sed input.utf8 | hexdump -C
    3e 61 65 3c 0a  |>ae<.|
  0005

Note how the U+00E4 = 0xc3a4 = LATIN SMALL LETTER A WITH DIAERESIS
gets replaced.

With that help, you ought to be able to get your task done.

> The sed(1) man page is silent.

That's because nothing was done yet to make sed(1) aware of UTF-8.

> The FAQ section on Character Sets
>  indicates that:
> 
>OpenBSD uses the ASCII character set by default.

Uh oh.  Ah, hrm.  Well, kind of, but not really.

The LC_CTYPE locale defaults to "C", but that's required for any
POSIX-conforming operating system.  By default, ksh(1) emacs editing
mode partly supports UTF-8, even when LC_CTYPE is C, but ksh(1) vi
editing mode does not yet (i have a partial patch for that).  By
default, xterm(1) and pod2man(1) run in UTF-8 mode on OpenBSD, while
they default to strange hybrids of ASCII and ISO-LATIN-1 elsewhere.
man(1) always fully supports UTF-8 input, but avoids it for output
unless you set LC_CTYPE to SOMETHING.UTF-8 or pass it the -Tutf8
flag.  And so on for many programs...  Even to describe the default
for one single program, saying nothing but a single word "ASCII" or
"UTF-8" is usually insufficient, and different programs are very
different.

Talking about "the" default makes no sense, really.

> It also supports the Unicode (UTF-8) character set.

Ooops!  Do we really say that?  That's a bold claim...  :-o

In a way, it is true.  You can do many things with UTF-8
characters, and arguably, that wouldn't be possible if UTF-8
weren't supported, right?

Then again, it is not completely true.  There are still many tools
that do not fully support UTF-8, and some that don't at all.

> but I'm not sure what bearing that has on this issue.

You are exactly right!  That statement is so imprecise that it is
completely unclear what it is: more or less true, a bold lie, or a
sweeping generalization?

...

Now i'm starting to feel curious.  Let me read on:

  "The list of supported locales can be obtained by running the
   command:  locale -a"

YIKES!!  It looks like i urgently have to fix that part of the FAQ.
As i stands, it is spreading FAQ:  Fear, Ancertainty, and Quoubt.

Yours,
  Ingo



Unicode Support in sed?

2016-08-11 Thread Scott Vanderbilt
I'm trying to use sed to munge some text in HTML files, converting 
Unicode characters to their HTML entity equivalents, however I can't 
seem to get it to work.


For instance, this command has no apparent effect:

  sed -i -e 's/\xe2\x80\x94/—/g' foo.html

Other sed operations using ASCII arguments work fine.

Does sed support Unicode in this fashion? The sed(1) man page is silent. 
The FAQ section on Character Sets 
 indicates that:


   OpenBSD uses the ASCII character set by default. It also supports
   the Unicode (UTF-8) character set.

but I'm not sure what bearing that has on this issue.

Running OpenBSD 6.0 (GENERIC.MP) #2302: Sat Jul 23 09:33:37 MDT 2016 (amd64)

Many thanks in advance for any assistance.