Re: grep replacement using sed is behaving oddly

David Wright Fri, 21 Oct 2022 12:15:19 -0700

On Fri 21 Oct 2022 at 14:15:01 (-0400), Greg Wooledge wrote:
> On Fri, Oct 21, 2022 at 08:01:00PM +0200, to...@tuxteam.de wrote:
> > On Fri, Oct 21, 2022 at 01:21:44PM -0400, Gary Dale wrote:
> > > I'm hoping someone can tell me what I'm doing wrong. I have a line in a 
> > > lot
> > > of HTML files that I'd like to remove. The line is:
> > > 
> > >             <hr  style="border-top: 1px solid rgb(0, 32, 159); margin:
> > > 0rem;">
> > > 
> > > I'm testing the sed command to remove it on just one file. When it works,
> > > I'll run it against *.html. My command is:
> > > 
> > >  sed -i -s 's/\s*\<hr\ \ style.*\>//g' history.html
> > > 
> > > Unfortunately, the replacement doesn't remove the line but rather leaves 
> > > me
> > > with:
> > > 
> > >             <;">
> > 
> > This looks as if the <> in the regexp were interpreted as left and right
> > word boundaries (but that would only be the case if you'd given the -E
> > (or -r) option).
> > 
> > Try explicitly adding the --posix option, perhaps...
> 
> Gary is using non-POSIX syntax (specifically the \s), so that's not going
> to help unless he first changes his regular expression to be standard.


The whitespace is tricky. I pasted the email into emacs, and I see
that there are NO-BREAK SPACEs at the start, and one after "hr".
Who knows whether they're really in the OP's files, or just put
there by their MUA.

> I think you might be on to something with the \< and \> here.  I can see
> absolutely no reason why Gary put backslashes in front of spaces and
> angle brackets here.

I'm guessing the reason is guessing.

> The backslashes in front of the spaces are probably
> just noise, and can be ignored.  The \< and \> on the other hand might
> be interpreted as something special, the same way \s is (because this is
> GNU sed, which loves to do nonstandard things).
> 
> unicorn:~$ echo 'abc <foo> xyz' | sed 's/<.*>//'
> abc  xyz
> unicorn:~$ echo 'abc <foo> xyz' | sed 's/\<.*\>//'
> 
> unicorn:~$ 
> 
> So... yeah, \< and/or \> clearly have some special meaning to GNU sed.
> Good luck figuring out what that is.

Word boundaries, as tomas said. The .*\> can be seen to have worked,
as matching stopped after the end of the word "rem", leaving the
punctuation behind.

> For Gary's actual problem, simply removing the backslashes where they're
> not wanted would be a good start.  Actually learning sed could be step 2.

The man/info pages leave a lot to be desired. A table with columns that showed:

  code    supported by    effect
     \s      -e           match all whitespace except NON-BREAK or whatever
             --posix
             -E
             --posix -E
             or whatever

might really help. As it is, unless you're looking at a real book,
you get a table like:

  '\s'
     Matches whitespace characters (spaces and tabs).  Newlines embedded
     in the pattern/hold spaces will also match:

  '\S'
     Matches non-whitespace characters.

  '\<'
     Matches the beginning of a word.

  '\>'
     Matches the end of a word.

but it's next to impossible to keep track of whether you're in a
section that's speaking POSIX, GNU, or some mid-20th century tradition.

> I feel obliged at this point to mention that parsing HTML with regular
> expressions is a fool's errand, and that sed should not be the tool of
> choice here.  Nor should grep, nor any other RE-based tool.  This goes
> triple when one doesn't even know the correct syntax for their RE.
> 
> https://stackoverflow.com/q/1732348

To be fair, I'm not sure whether the OP is really trying to parse
HTML, or just remove some similar strings that they see as redundant.

Cheers,
David.

Re: grep replacement using sed is behaving oddly

Reply via email to