On Fri 21 Oct 2022 at 14:15:01 (-0400), Greg Wooledge wrote: > On Fri, Oct 21, 2022 at 08:01:00PM +0200, to...@tuxteam.de wrote: > > On Fri, Oct 21, 2022 at 01:21:44PM -0400, Gary Dale wrote: > > > I'm hoping someone can tell me what I'm doing wrong. I have a line in a > > > lot > > > of HTML files that I'd like to remove. The line is: > > > > > > <hr style="border-top: 1px solid rgb(0, 32, 159); margin: > > > 0rem;"> > > > > > > I'm testing the sed command to remove it on just one file. When it works, > > > I'll run it against *.html. My command is: > > > > > > sed -i -s 's/\s*\<hr\ \ style.*\>//g' history.html > > > > > > Unfortunately, the replacement doesn't remove the line but rather leaves > > > me > > > with: > > > > > > <;"> > > > > This looks as if the <> in the regexp were interpreted as left and right > > word boundaries (but that would only be the case if you'd given the -E > > (or -r) option). > > > > Try explicitly adding the --posix option, perhaps... > > Gary is using non-POSIX syntax (specifically the \s), so that's not going > to help unless he first changes his regular expression to be standard.
The whitespace is tricky. I pasted the email into emacs, and I see that there are NO-BREAK SPACEs at the start, and one after "hr". Who knows whether they're really in the OP's files, or just put there by their MUA. > I think you might be on to something with the \< and \> here. I can see > absolutely no reason why Gary put backslashes in front of spaces and > angle brackets here. I'm guessing the reason is guessing. > The backslashes in front of the spaces are probably > just noise, and can be ignored. The \< and \> on the other hand might > be interpreted as something special, the same way \s is (because this is > GNU sed, which loves to do nonstandard things). > > unicorn:~$ echo 'abc <foo> xyz' | sed 's/<.*>//' > abc xyz > unicorn:~$ echo 'abc <foo> xyz' | sed 's/\<.*\>//' > > unicorn:~$ > > So... yeah, \< and/or \> clearly have some special meaning to GNU sed. > Good luck figuring out what that is. Word boundaries, as tomas said. The .*\> can be seen to have worked, as matching stopped after the end of the word "rem", leaving the punctuation behind. > For Gary's actual problem, simply removing the backslashes where they're > not wanted would be a good start. Actually learning sed could be step 2. The man/info pages leave a lot to be desired. A table with columns that showed: code supported by effect \s -e match all whitespace except NON-BREAK or whatever --posix -E --posix -E or whatever might really help. As it is, unless you're looking at a real book, you get a table like: '\s' Matches whitespace characters (spaces and tabs). Newlines embedded in the pattern/hold spaces will also match: '\S' Matches non-whitespace characters. '\<' Matches the beginning of a word. '\>' Matches the end of a word. but it's next to impossible to keep track of whether you're in a section that's speaking POSIX, GNU, or some mid-20th century tradition. > I feel obliged at this point to mention that parsing HTML with regular > expressions is a fool's errand, and that sed should not be the tool of > choice here. Nor should grep, nor any other RE-based tool. This goes > triple when one doesn't even know the correct syntax for their RE. > > https://stackoverflow.com/q/1732348 To be fair, I'm not sure whether the OP is really trying to parse HTML, or just remove some similar strings that they see as redundant. Cheers, David.