Re: grep replacement using sed is behaving oddly

Gary Dale Sat, 22 Oct 2022 06:23:35 -0700

On 2022-10-21 15:14, David Wright wrote:

On Fri 21 Oct 2022 at 14:15:01 (-0400), Greg Wooledge wrote:

On Fri, Oct 21, 2022 at 08:01:00PM +0200, [email protected] wrote:

On Fri, Oct 21, 2022 at 01:21:44PM -0400, Gary Dale wrote:

I'm hoping someone can tell me what I'm doing wrong. I have a line in a lot
of HTML files that I'd like to remove. The line is:


             <hr  style="border-top: 1px solid rgb(0, 32, 159); margin:
0rem;">

I'm testing the sed command to remove it on just one file. When it works,
I'll run it against *.html. My command is:

  sed -i -s 's/\s*\<hr\ \ style.*\>//g' history.html

Unfortunately, the replacement doesn't remove the line but rather leaves me
with:

             <;">

This looks as if the <> in the regexp were interpreted as left and right
word boundaries (but that would only be the case if you'd given the -E
(or -r) option).

Try explicitly adding the --posix option, perhaps...

Gary is using non-POSIX syntax (specifically the \s), so that's not going
to help unless he first changes his regular expression to be standard.

The whitespace is tricky. I pasted the email into emacs, and I see
that there are NO-BREAK SPACEs at the start, and one after "hr".
Who knows whether they're really in the OP's files, or just put
there by their MUA.

I think you might be on to something with the \< and \> here.  I can see
absolutely no reason why Gary put backslashes in front of spaces and
angle brackets here.

I'm guessing the reason is guessing.

The backslashes in front of the spaces are probably
just noise, and can be ignored.  The \< and \> on the other hand might
be interpreted as something special, the same way \s is (because this is
GNU sed, which loves to do nonstandard things).

unicorn:~$ echo 'abc <foo> xyz' | sed 's/<.*>//'
abc  xyz
unicorn:~$ echo 'abc <foo> xyz' | sed 's/\<.*\>//'

unicorn:~$

So... yeah, \< and/or \> clearly have some special meaning to GNU sed.
Good luck figuring out what that is.

Word boundaries, as tomas said. The .*\> can be seen to have worked,
as matching stopped after the end of the word "rem", leaving the
punctuation behind.

For Gary's actual problem, simply removing the backslashes where they're
not wanted would be a good start.  Actually learning sed could be step 2.

The man/info pages leave a lot to be desired. A table with columns that showed:

   code    supported by    effect
      \s      -e           match all whitespace except NON-BREAK or whatever
              --posix
              -E
              --posix -E
              or whatever

might really help. As it is, unless you're looking at a real book,
you get a table like:

   '\s'
      Matches whitespace characters (spaces and tabs).  Newlines embedded
      in the pattern/hold spaces will also match:

   '\S'
      Matches non-whitespace characters.

   '\<'
      Matches the beginning of a word.

   '\>'
      Matches the end of a word.

but it's next to impossible to keep track of whether you're in a
section that's speaking POSIX, GNU, or some mid-20th century tradition.

I feel obliged at this point to mention that parsing HTML with regular
expressions is a fool's errand, and that sed should not be the tool of
choice here.  Nor should grep, nor any other RE-based tool.  This goes
triple when one doesn't even know the correct syntax for their RE.

https://stackoverflow.com/q/1732348

To be fair, I'm not sure whether the OP is really trying to parse
HTML, or just remove some similar strings that they see as redundant.

Cheers,
David.


Thanks. This command

    sed -i '/<hr  style.*>/d' *.html

did the trick.

I've gotten into the habit of escaping special characters rather thanmemorizing the full list of which ones need to be escaped. I do most ofmy editing in Kate but use sed from to time when making the same changeto all the files in a web site, as was the case here. Obviously I wasn'taware of the special meaning of \< and \> in sed...


Thanks again.

Re: grep replacement using sed is behaving oddly

Reply via email to