On Fri 21 Oct 2022 at 14:15:01 (-0400), Greg Wooledge wrote:
> On Fri, Oct 21, 2022 at 08:01:00PM +0200, [email protected] wrote:
> > On Fri, Oct 21, 2022 at 01:21:44PM -0400, Gary Dale wrote:
> > > I'm hoping someone can tell me what I'm doing wrong. I have a line in a
> > > lot
> > > of HTML files that I'd like to remove. The line is:
> > >
> > > <hr style="border-top: 1px solid rgb(0, 32, 159); margin:
> > > 0rem;">
> > >
> > > I'm testing the sed command to remove it on just one file. When it works,
> > > I'll run it against *.html. My command is:
> > >
> > > sed -i -s 's/\s*\<hr\ \ style.*\>//g' history.html
> > >
> > > Unfortunately, the replacement doesn't remove the line but rather leaves
> > > me
> > > with:
> > >
> > > <;">
> >
> > This looks as if the <> in the regexp were interpreted as left and right
> > word boundaries (but that would only be the case if you'd given the -E
> > (or -r) option).
> >
> > Try explicitly adding the --posix option, perhaps...
>
> Gary is using non-POSIX syntax (specifically the \s), so that's not going
> to help unless he first changes his regular expression to be standard.
The whitespace is tricky. I pasted the email into emacs, and I see
that there are NO-BREAK SPACEs at the start, and one after "hr".
Who knows whether they're really in the OP's files, or just put
there by their MUA.
> I think you might be on to something with the \< and \> here. I can see
> absolutely no reason why Gary put backslashes in front of spaces and
> angle brackets here.
I'm guessing the reason is guessing.
> The backslashes in front of the spaces are probably
> just noise, and can be ignored. The \< and \> on the other hand might
> be interpreted as something special, the same way \s is (because this is
> GNU sed, which loves to do nonstandard things).
>
> unicorn:~$ echo 'abc <foo> xyz' | sed 's/<.*>//'
> abc xyz
> unicorn:~$ echo 'abc <foo> xyz' | sed 's/\<.*\>//'
>
> unicorn:~$
>
> So... yeah, \< and/or \> clearly have some special meaning to GNU sed.
> Good luck figuring out what that is.
Word boundaries, as tomas said. The .*\> can be seen to have worked,
as matching stopped after the end of the word "rem", leaving the
punctuation behind.
> For Gary's actual problem, simply removing the backslashes where they're
> not wanted would be a good start. Actually learning sed could be step 2.
The man/info pages leave a lot to be desired. A table with columns that showed:
code supported by effect
\s -e match all whitespace except NON-BREAK or whatever
--posix
-E
--posix -E
or whatever
might really help. As it is, unless you're looking at a real book,
you get a table like:
'\s'
Matches whitespace characters (spaces and tabs). Newlines embedded
in the pattern/hold spaces will also match:
'\S'
Matches non-whitespace characters.
'\<'
Matches the beginning of a word.
'\>'
Matches the end of a word.
but it's next to impossible to keep track of whether you're in a
section that's speaking POSIX, GNU, or some mid-20th century tradition.
> I feel obliged at this point to mention that parsing HTML with regular
> expressions is a fool's errand, and that sed should not be the tool of
> choice here. Nor should grep, nor any other RE-based tool. This goes
> triple when one doesn't even know the correct syntax for their RE.
>
> https://stackoverflow.com/q/1732348
To be fair, I'm not sure whether the OP is really trying to parse
HTML, or just remove some similar strings that they see as redundant.
Cheers,
David.