Re: remove an HTML tag and all its children from commandline
T o n g 写道: For not-so-simple tasks, you need not-so-simple tools. Depending on how much time you'd like to investigate into such not-so-simple tools, take a look at lib?, sgrep or the xpath language. Sure. libwww and sgrep are tools, while xpath is a language. I believe I should try xpath because I might use use it in other places too, but what tool to use for xpath? Is there a handy commandline too for it? The thing I worry a bit about xpath is: if it normalize or correct HTML errors, or align it differently, in the output, after I have done the removal, it would be big a problem for me, because I am a link on the corporate workflow chain where others rely on poorly made tools and incorrect and turbulent HTML to do their daily work and I must not break them by improving the HTML, unless I do not want to keep current peaceful and lazy life and save time for more valuable sane projects. I am pretty sure sgrep can solve my problem after glanced the manual, though. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: remove an HTML tag and all its children from commandline
Steve Kemp 写道: You might enjoy my html-tool command which would do the job for you via: Thank you very much for mentioning this tool. A first glance it seems this tool is just too wonderful, it is just designed to solve problems like mine. However after I try it what I worry most happened: The thing I worry a bit about xpath is: if it normalize or correct HTML errors, or align it differently, in the output, after I have done the removal, it would be big a problem for me, because I am a link on the corporate workflow chain where others rely on poorly made tools and incorrect and turbulent HTML to do their daily work and I must not break them by improving the HTML, unless I do not want to keep current peaceful and lazy life and save time for more valuable sane projects. Unfortunately it does. The output HTML no longer work with the stupid drag-and-drop-html-edit-for-idiot my web design guy is using. I am in position of delivering a signed contract, not in evaluating if a contract can be done, this situation means I cannot take html-tool as an option. But I will well keep it in mind to use when feasible! As time is tight I guess I just use the most turbulent solution: adding the following to all HTML pages: style type=text/css .advertisement { display: none; } /style It is a silly solution that punishes web visitor for web designer's fault. But on the other hand, I think the web designer who made the junk HTML really should not enjoy too much help from me. Maybe I just let it go this way. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: remove an HTML tag and all its children from commandline
On Sun Jan 31, 2010 at 10:54:46 +0800, Zhang Weiwu wrote: I want to remove all advertisements in my 100 html files. They are pretty neatly classed, like the following: div class=advertisement ... /div You might enjoy my html-tool command which would do the job for you via: html-tool --cut-class=advertisement --file input.html You can get it via: wget http://mybin.repository.steve.org.uk/raw-file/tip/html-tool Or via the repository at: http://mybin.repository.steve.org.uk/ See here for some brief discussion: http://blog.steve.org.uk/oh__this_should_be_stunning_.html Internally it uses the XPath perl module HTML::TreeBuilder::Xpath, but the details probably don't matter. Steve -- -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: remove an HTML tag and all its children from commandline
Zhang Weiwu 写道: Sure. libwww and sgrep are tools, while xpath is a language. I believe I should try xpath because I might use use it in other places too, but what tool to use for xpath? Now I think I can answer my own question, partly at least. There is a good tool for xpath that is named xpath. In debian it is in this package: $ apt-file search /usr/bin/xpath libxml-xpath-perl: /usr/bin/xpath An example of using the tool: print the advertisement is: $ tidy -q -asxml -utf8 page_07_zh.html | xpath -e '//d...@class=advertisement]' -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: remove an HTML tag and all its children from commandline
On Sun, 31 Jan 2010 20:05:46 +0800, Zhang Weiwu wrote: $ tidy -q -asxml -utf8 page_07_zh.html | xpath -e '//d...@class=advertisement]' exactly. Glad that you found both tidy libxml-xpath-perl, and solve the problem yourself. -- Tong (remove underscore(s) to reply) http://xpt.sourceforge.net/techdocs/ http://xpt.sourceforge.net/tools/ -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
remove an HTML tag and all its children from commandline
Hello. I believe this is a common case and must have been discussed before on various other forums like awk/sed/regular expression group. However I could not google them out. You would be helping me a lot if you simply point to a reference to a solution. I want to remove all advertisements in my 100 html files. They are pretty neatly classed, like the following: div class=advertisement ... /div However I could not simply do this: s/div class=advertisement.*/div// Because it is too greedy, that matches the /div till the last, which is almost always after the advertisement. If I set it to not to be greedy, it also fail because it stops at the first /div inside the advertisement. Consider this case that both greedy and non-greedy fail: div class=page-content div class=advertisement divOur product is the best/div divContact us now!/div /div /div Greedy output: div class=page-content Non-greedy output: div class=page-content divContact us now!/div /div /div Expected output: div class=page-content /div The only way to make it right seems to be able to give the replacement / remove expression the ability to count the number of div and /div it encounters. I could program such thing in C thanks to my college education, but it sounds overkill for such a common task. What would you do in this case? -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: remove an HTML tag and all its children from commandline
On Sun, 31 Jan 2010 10:54:46 +0800, Zhang Weiwu wrote: I want to remove all advertisements in my 100 html files. They are pretty neatly classed, like the following: div class=advertisement ... /div However I could not simply do this: s/div class=advertisement.*/div// Because it is too greedy For not-so-simple tasks, you need not-so-simple tools. Depending on how much time you'd like to investigate into such not-so-simple tools, take a look at lib?, sgrep or the xpath language. HTH -- Tong (remove underscore(s) to reply) http://xpt.sourceforge.net/techdocs/ http://xpt.sourceforge.net/tools/ -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: remove an HTML tag and all its children from commandline
On Sun, 31 Jan 2010 10:54:46 +0800 Zhang Weiwu zhangwe...@realss.com wrote: ... I want to remove all advertisements in my 100 html files. They are pretty neatly classed, like the following: div class=advertisement ... /div However I could not simply do this: s/div class=advertisement.*/div// Because it is too greedy, that matches the /div till the last, which is almost always after the advertisement. If I set it to not to be greedy, it also fail because it stops at the first /div inside the advertisement. ... The only way to make it right seems to be able to give the replacement / remove expression the ability to count the number of div and /div it encounters. I could program such thing in C thanks to my college education, but it sounds overkill for such a common task. What would you do in this case? Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness: You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The center cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's … er … code. http://www.codinghorror.com/blog/archives/001311.html Read on for more detail, and the Right Way to do this. Celejar -- foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator mailmin.sourceforge.net - remote access via secure (OpenPGP) email ssuds.sourceforge.net - A Simple Sudoku Solver and Generator -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org