Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Zhang Weiwu
T o n g 写道:
 For not-so-simple tasks, you need not-so-simple tools. Depending on how 
 much time you'd like to investigate into such not-so-simple tools, take a 
 look at lib?, sgrep or the xpath language. 
   
Sure. libwww and sgrep are tools, while xpath is a language. I believe I
should try xpath because I might use use it in other places too, but
what tool to use for xpath? Is there a handy commandline too for it? The
thing I worry a bit about xpath is: if it normalize or correct HTML
errors, or align it differently, in the output, after I have done the
removal, it would be big a problem for me, because I am a link on the
corporate workflow chain where others rely on poorly made tools and
incorrect and turbulent HTML to do their daily work and I must not break
them by improving the HTML, unless I do not want to keep current
peaceful and lazy life and save time for more valuable sane projects.

I am pretty sure sgrep can solve my problem after glanced the manual,
though.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Zhang Weiwu
Steve Kemp 写道:

   You might enjoy my html-tool command which would do the
  job for you via:
   
Thank you very much for mentioning this tool. A first glance it seems
this tool is just too wonderful, it is just designed to solve problems
like mine. However after I try it what I worry most happened:
 The
 thing I worry a bit about xpath is: if it normalize or correct HTML
 errors, or align it differently, in the output, after I have done the
 removal, it would be big a problem for me, because I am a link on the
 corporate workflow chain where others rely on poorly made tools and
 incorrect and turbulent HTML to do their daily work and I must not break
 them by improving the HTML, unless I do not want to keep current
 peaceful and lazy life and save time for more valuable sane projects.
Unfortunately it does. The output HTML no longer work with the stupid
drag-and-drop-html-edit-for-idiot my web design guy is using. I am in
position of delivering a signed contract, not in evaluating if a
contract can be done, this situation means I cannot take html-tool as an
option. But I will well keep it in mind to use when feasible!

As time is tight I guess I just use the most turbulent solution: adding
the following to all HTML pages:

style type=text/css
.advertisement {
display: none;
}
/style

It is a silly solution that punishes web visitor for web designer's
fault. But on the other hand, I think the web designer who made the junk
HTML really should not enjoy too much help from me. Maybe I just let it
go this way.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Steve Kemp
On Sun Jan 31, 2010 at 10:54:46 +0800, Zhang Weiwu wrote:

 I want to remove all advertisements in my 100 html files. They are
 pretty neatly classed, like the following:

 div class=advertisement
 ...
 /div

  You might enjoy my html-tool command which would do the
 job for you via:

html-tool --cut-class=advertisement --file input.html

  You can get it via:

wget http://mybin.repository.steve.org.uk/raw-file/tip/html-tool

  Or via the repository at:

http://mybin.repository.steve.org.uk/

  See here for some brief discussion:

http://blog.steve.org.uk/oh__this_should_be_stunning_.html

  Internally it uses the XPath perl module HTML::TreeBuilder::Xpath,
 but the details probably don't matter.

Steve
--


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Zhang Weiwu
Zhang Weiwu 写道:
 Sure. libwww and sgrep are tools, while xpath is a language. I believe I
 should try xpath because I might use use it in other places too, but
 what tool to use for xpath?
Now I think I can answer my own question, partly at least. There is a
good tool for xpath that is named xpath. In debian it is in this package:
$ apt-file search /usr/bin/xpath
libxml-xpath-perl: /usr/bin/xpath

An example of using the tool: print the advertisement is:

$ tidy -q -asxml -utf8 page_07_zh.html | xpath -e 
'//d...@class=advertisement]'


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread T o n g
On Sun, 31 Jan 2010 20:05:46 +0800, Zhang Weiwu wrote:

 $ tidy -q -asxml -utf8 page_07_zh.html | xpath -e
 '//d...@class=advertisement]'

exactly. Glad that you found both tidy  libxml-xpath-perl, and solve the 
problem yourself.

-- 
Tong (remove underscore(s) to reply)
  http://xpt.sourceforge.net/techdocs/
  http://xpt.sourceforge.net/tools/


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



remove an HTML tag and all its children from commandline

2010-01-30 Thread Zhang Weiwu
Hello. I believe this is a common case and must have been discussed
before on various other forums like awk/sed/regular expression group.
However I could not google them out. You would be helping me a lot if
you simply point to a reference to a solution.

I want to remove all advertisements in my 100 html files. They are
pretty neatly classed, like the following:

div class=advertisement
...
/div

However I could not simply do this:
s/div class=advertisement.*/div//

Because it is too greedy, that matches the /div till the last, which
is almost always after the advertisement.

If I set it to not to be greedy, it also fail because it stops at the
first /div inside the advertisement.

Consider this case that both greedy and non-greedy fail:

div class=page-content
  div class=advertisement
divOur product is the best/div
divContact us now!/div
  /div
/div

Greedy output:

div class=page-content

Non-greedy output:

div class=page-content
divContact us now!/div
  /div
/div


Expected output:

div class=page-content
/div

The only way to make it right seems to be able to give the replacement /
remove expression the ability to count the number of div and /div
it encounters. I could program such thing in C thanks to my college
education, but it sounds overkill for such a common task. What would you
do in this case?


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: remove an HTML tag and all its children from commandline

2010-01-30 Thread T o n g
On Sun, 31 Jan 2010 10:54:46 +0800, Zhang Weiwu wrote:

 I want to remove all advertisements in my 100 html files. They are
 pretty neatly classed, like the following:
 
 div class=advertisement
 ...
 /div
 
 However I could not simply do this:
 s/div class=advertisement.*/div//
 
 Because it is too greedy

For not-so-simple tasks, you need not-so-simple tools. Depending on how 
much time you'd like to investigate into such not-so-simple tools, take a 
look at lib?, sgrep or the xpath language. 

HTH

-- 
Tong (remove underscore(s) to reply)
  http://xpt.sourceforge.net/techdocs/
  http://xpt.sourceforge.net/tools/


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: remove an HTML tag and all its children from commandline

2010-01-30 Thread Celejar
On Sun, 31 Jan 2010 10:54:46 +0800
Zhang Weiwu zhangwe...@realss.com wrote:

...

 I want to remove all advertisements in my 100 html files. They are
 pretty neatly classed, like the following:
 
 div class=advertisement
 ...
 /div
 
 However I could not simply do this:
 s/div class=advertisement.*/div//
 
 Because it is too greedy, that matches the /div till the last, which
 is almost always after the advertisement.
 
 If I set it to not to be greedy, it also fail because it stops at the
 first /div inside the advertisement.

...

 The only way to make it right seems to be able to give the replacement /
 remove expression the ability to count the number of div and /div
 it encounters. I could program such thing in C thanks to my college
 education, but it sounds overkill for such a common task. What would you
 do in this case?

Among programmers of any experience, it is generally regarded as A Bad
Ideatm to attempt to parse HTML with regular expressions. How bad of an
idea? It apparently drove one Stack Overflow user to the brink of
madness:

You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML. As
I have answered in HTML-and-regex questions here so many times before,
the use of regex will not allow you to consume HTML.

Regular expressions are a tool that is insufficiently sophisticated to
understand the constructs employed by HTML. HTML is not a regular
language and hence cannot be parsed by regular expressions. Regex
queries are not equipped to break down HTML into its meaningful parts.
so many times but it is not getting to me. Even enhanced irregular
regular expressions as used by Perl are not up to the task of parsing
HTML. You will never make me crack. HTML is a language of sufficient
complexity that it cannot be parsed by regular expressions.

Even Jon Skeet cannot parse HTML using regular expressions. Every time
you attempt to parse HTML with regular expressions, the unholy child
weeps the blood of virgins, and Russian hackers pwn your webapp.
Parsing HTML with regex summons tainted souls into the realm of the
living. HTML and regex go together like love, marriage, and ritual
infanticide. The center cannot hold it is too late. The force of
regex and HTML together in the same conceptual space will destroy your
mind like so much watery putty. If you parse HTML with regex you are
giving in to Them and their blasphemous ways which doom us all to
inhuman toil for the One whose Name cannot be expressed in the Basic
Multilingual Plane, he comes.

That's right, if you attempt to parse HTML with regular expressions,
you're succumbing to the temptations of the dark god Cthulhu's … er …
code.

http://www.codinghorror.com/blog/archives/001311.html

Read on for more detail, and the Right Way to do this.

Celejar
-- 
foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator
mailmin.sourceforge.net - remote access via secure (OpenPGP) email
ssuds.sourceforge.net - A Simple Sudoku Solver and Generator


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org