remove an HTML tag and all its children from commandline

Zhang Weiwu Sat, 30 Jan 2010 19:28:31 -0800

Hello. I believe this is a common case and must have been discussed
before on various other forums like awk/sed/regular expression group.
However I could not google them out. You would be helping me a lot if
you simply point to a reference to a solution.


I want to remove all advertisements in my 100 html files. They are
pretty neatly classed, like the following:

<div class="advertisement">
...
</div>

However I could not simply do this:
s/<div class="advertisement">.*</div>//

Because it is too greedy, that matches the "</div>" till the last, which
is almost always after the advertisement.

If I set it to not to be greedy, it also fail because it stops at the
first </div> inside the advertisement.

Consider this case that both greedy and non-greedy fail:

<div class="page-content">
  <div class="advertisement">
    <div>Our product is the best</div>
    <div>Contact us now!</div>
  </div>
</div>

Greedy output:

    <div class="page-content">

Non-greedy output:

    <div class="page-content">
        <div>Contact us now!</div>
      </div>
    </div>


Expected output:

    <div class="page-content">
    </div>

The only way to make it right seems to be able to give the replacement /
remove expression the ability to "count" the number of <div and </div>
it encounters. I could program such thing in C thanks to my college
education, but it sounds overkill for such a common task. What would you
do in this case?


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

remove an HTML tag and all its children from commandline

Reply via email to