Re: parsing html
On Thu, 8 Aug 2013 22:42:01 +0530 Unknown User knowsuperunkn...@gmail.com wrote: What would be the best module available for parsing html in your opinion? My intention is to parse html that contains a table of 5 columns and any number of rows For parsing HTML tables, you want HTML::TableExtract, IMO. https://metacpan.org/module/HTML::TableExtract It makes life easy. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: parsing html
Have a look at HTML::PARSER. On Aug 8, 2013 10:50 PM, Unknown User knowsuperunkn...@gmail.com wrote: What would be the best module available for parsing html in your opinion? My intention is to parse html that contains a table of 5 columns and any number of rows, and have a hash ref like $html-{1}-{col1}=data11, $html-{1}-{col2}=data12 ... $html-{2}-{col1}=data21, $html-{2}-{col2}=data22 ... etc Would there be an existing module that can do this without too much effort on my part? Thanks,
Re: parsing html
On 8 Aug 2013 18:19, Unknown User knowsuperunkn...@gmail.com wrote: What would be the best module available for parsing html in your opinion? I would also say look at HTML::TreeBuilder My intention is to parse html that contains a table of 5 columns and any number of rows, and have a hash ref like $html-{1}-{col1}=data11, $html-{1}-{col2}=data12 ... $html-{2}-{col1}=data21, $html-{2}-{col2}=data22 ... etc Would there be an existing module that can do this without too much effort on my part? Thanks,
Re: parsing html
On Thu, Aug 8, 2013 at 10:18 AM, David Precious dav...@preshweb.co.ukwrote: On Thu, 8 Aug 2013 22:42:01 +0530 Unknown User knowsuperunkn...@gmail.com wrote: What would be the best module available for parsing html in your opinion? My intention is to parse html that contains a table of 5 columns and any number of rows For parsing HTML tables, you want HTML::TableExtract, IMO. https://metacpan.org/module/HTML::TableExtract It makes life easy. I'm with David since the stated objective was to extract table info. As powerful as the other parsing modules are, you certainly would be grinding out more code. -- Charles DeRykus
Re: parsing html
On 08/08/2013 18:18, David Precious wrote: On Thu, 8 Aug 2013 22:42:01 +0530 Unknown User knowsuperunkn...@gmail.com wrote: What would be the best module available for parsing html in your opinion? My intention is to parse html that contains a table of 5 columns and any number of rows For parsing HTML tables, you want HTML::TableExtract, IMO. Yes. It would help if you explained more about what you want to do, and showed some HTML rather than just the desired data structure, but HTML::TableExtract seems to be the right module for this. There are better HTML parsers for general purposes. XML::LibXML will handle HTML, and HTML::TreeBuilder is nice. Rob -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: Parsing HTML file
Hi try html trees on cpan there is also html tables, but I haven't used it. Trees can take a while to understand but it seems to be pretty good. Pat On Mon, Jun 2, 2008 at 12:10 PM, Jeff Peng [EMAIL PROTECTED] wrote: On Mon, Jun 2, 2008 at 5:56 PM, Purohit, Bhargav [EMAIL PROTECTED] wrote: I have a HTML file and which contains many tables in it. Out of one of the table I want to extract Information of only few of the columns. Can anybody help me. Nobody can help unless you have tried some coding at first. Take a look at this module on CPAN: http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm -- Jeff Peng - [EMAIL PROTECTED] Professional Squid supports in China http://www.ChinaSquid.com/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: Parsing HTML file
On Mon, Jun 2, 2008 at 5:56 PM, Purohit, Bhargav [EMAIL PROTECTED] wrote: I have a HTML file and which contains many tables in it. Out of one of the table I want to extract Information of only few of the columns. Can anybody help me. Nobody can help unless you have tried some coding at first. Take a look at this module on CPAN: http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm -- Jeff Peng - [EMAIL PROTECTED] Professional Squid supports in China http://www.ChinaSquid.com/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing html question
On Feb 5, 2008 10:36 AM, isaac2004 [EMAIL PROTECTED] wrote: hello, i am trying to parse an html document for links for output, my idea is to grab the URL from a form and send the URL to another file that does the actual parse process. i am aware that HTML:Parser has a built in for this, but i want to learn regex better. Let me know if your technique helps you to learn regular expressions. You will also have the chance to learn how complex a data format HTML is, and why parsers are more complex than simple pattern matches. For other people who want to learn more about regular expressions, I'd recommend the book Mastering Regular Expressions by Jeffrey Friedl. REs are a small language of their own, worthy of study. This book has interesting, useful, and practical information for any programmer who works with regular expressions. In fact, when the book first came out, everyone was expecting that any book on REs would be a pocket reference. Everyone was stunned at how much of this fat book was news even to the experts. (Several bugs and shortcomings of Perl's RE engine were fixed as a direct result of that first edition.) I'm sure that anyone who frequently uses Perl's patterns, or any other regular expressions, will find this book informative today. http://regex.info/ Cheers! --Tom Phoenix Stonehenge Perl Training -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing HTML content
On Aug 30, 9:37 pm, [EMAIL PROTECTED] (Daniel Kasak) wrote: On Thu, 2007-08-30 at 07:16 -0700, ladder49 wrote: Is there a way to dump the HTML code for a web page? I need to write a script which will collect and summarize content from intranet web pages. By dump, I mean to read it the same way you would read a file and parse its contents. Thanks. I use LWP::Simple to fetch stuff, and HTML::TreeBuilder to parse it and extract stuff. -- Daniel Kasak IT Developer NUS Consulting Group Level 5, 77 Pacific Highway North Sydney, NSW, Australia 2060 T: (+61) 2 9922-7676 / F: (+61) 2 9922 7989 email: [EMAIL PROTECTED] website:http://www.nusconsulting.com.au Daniel, Jeff, Thanks for your replies. Your pointing me to LWP got me started and now I've got a working script. Thanks again! -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing HTML content
2007/8/30, ladder49 [EMAIL PROTECTED]: Is there a way to dump the HTML code for a web page? I need to write a script which will collect and summarize content from intranet web pages. By dump, I mean to read it the same way you would read a file and parse its contents. Thanks. You can use lwp to do this.like, perl -MLWP::Simple -e '$c=get(http://www.yahoo.cn/;);print $c' see also `perldoc lwpcook`. -- Jeff Pang - [EMAIL PROTECTED] http://www.readwriteweb.com/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing HTML content
On Thu, 2007-08-30 at 07:16 -0700, ladder49 wrote: Is there a way to dump the HTML code for a web page? I need to write a script which will collect and summarize content from intranet web pages. By dump, I mean to read it the same way you would read a file and parse its contents. Thanks. I use LWP::Simple to fetch stuff, and HTML::TreeBuilder to parse it and extract stuff. -- Daniel Kasak IT Developer NUS Consulting Group Level 5, 77 Pacific Highway North Sydney, NSW, Australia 2060 T: (+61) 2 9922-7676 / F: (+61) 2 9922 7989 email: [EMAIL PROTECTED] website: http://www.nusconsulting.com.au -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: Parsing HTML (Table)
yitzle wrote: I'm using WWW::Mechanize to retrieve a web page. I get to this line: my $page = $mech-response()-decoded_content(); The page got a table with values I wish to extract. What module is best suited to getting to that data? I'm hoping for a somewhat simple to use module. WWW::Mechanize is the first object oriented module I used (this project), so I'm still getting used to it. Hi. I recommend HTML::TableExtract for what you're describing. We can help you more if you give us the URL of the web page you're looking at and what data you need from it. HTH, Rob -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing html data
El Martes 22 Mayo 2007 22:38, [EMAIL PROTECTED] escribió: Hi, If the problem is to get dictionary translation, I think you should not work directkly with LWP or WWW::Mechanize. Those modules provides convinient way to get the answer in HTML. This will force you to hack html files. Altwernatively, you can use dictionaries that provides API like www.dictionary.com (and use the perl module WWW::Dictionary http://search.cpan.org/~cog/WWW-Dictionary-0.01/lib/WWW/Dictionary.pm) or even wikipedia (WWW::Wikipedia at http://search.cpan.org/~bricas/WWW-Wikipedia-1.92/lib/WWW/Wikipedia.pm). You can manu more dictionaries in the cpan. Yaron Kahanovitch - Original Message - From: xavier mas [EMAIL PROTECTED] To: beginners@perl.org Sent: 22:14:07 (GMT+0200) Africa/Harare יום שלישי 22 מאי 2007 Subject: Re: parsing html data El Martes 22 Mayo 2007 22:00, David Moreno Garza escribió: xavier mas wrote: dear all, I'm trying to make a consult to a online dictionary from an html document, the result of the consult has to be output in a text field of the same html document. I understand this can be done using (perl) cgi scripts but am not sure which module do I need for that. A number of modules around LWP can help you with it. WWW::Mechanize, specifically, can help you with its specific method, content(): $mech-content(format = 'text'); -- David Moreno Garza [EMAIL PROTECTED] | http://www.damog.net/ URL:http://pub.tsn.dk/how-to-quote.php Muchas gracias, David. -- Xavier Mas -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ Thank you yaron, I know this, but the purpose is to do an apps that makes that. Greetings, -- Xavier Mas -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing html data
xavier mas wrote: dear all, I'm trying to make a consult to a online dictionary from an html document, the result of the consult has to be output in a text field of the same html document. I understand this can be done using (perl) cgi scripts but am not sure which module do I need for that. A number of modules around LWP can help you with it. WWW::Mechanize, specifically, can help you with its specific method, content(): $mech-content(format = 'text'); -- David Moreno Garza [EMAIL PROTECTED] | http://www.damog.net/ URL:http://pub.tsn.dk/how-to-quote.php -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing html data
El Martes 22 Mayo 2007 22:00, David Moreno Garza escribió: xavier mas wrote: dear all, I'm trying to make a consult to a online dictionary from an html document, the result of the consult has to be output in a text field of the same html document. I understand this can be done using (perl) cgi scripts but am not sure which module do I need for that. A number of modules around LWP can help you with it. WWW::Mechanize, specifically, can help you with its specific method, content(): $mech-content(format = 'text'); -- David Moreno Garza [EMAIL PROTECTED] | http://www.damog.net/ URL:http://pub.tsn.dk/how-to-quote.php Muchas gracias, David. -- Xavier Mas -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: parsing html data
Hi, If the problem is to get dictionary translation, I think you should not work directkly with LWP or WWW::Mechanize. Those modules provides convinient way to get the answer in HTML. This will force you to hack html files. Altwernatively, you can use dictionaries that provides API like www.dictionary.com (and use the perl module WWW::Dictionary http://search.cpan.org/~cog/WWW-Dictionary-0.01/lib/WWW/Dictionary.pm) or even wikipedia (WWW::Wikipedia at http://search.cpan.org/~bricas/WWW-Wikipedia-1.92/lib/WWW/Wikipedia.pm). You can manu more dictionaries in the cpan. Yaron Kahanovitch - Original Message - From: xavier mas [EMAIL PROTECTED] To: beginners@perl.org Sent: 22:14:07 (GMT+0200) Africa/Harare יום שלישי 22 מאי 2007 Subject: Re: parsing html data El Martes 22 Mayo 2007 22:00, David Moreno Garza escribió: xavier mas wrote: dear all, I'm trying to make a consult to a online dictionary from an html document, the result of the consult has to be output in a text field of the same html document. I understand this can be done using (perl) cgi scripts but am not sure which module do I need for that. A number of modules around LWP can help you with it. WWW::Mechanize, specifically, can help you with its specific method, content(): $mech-content(format = 'text'); -- David Moreno Garza [EMAIL PROTECTED] | http://www.damog.net/ URL:http://pub.tsn.dk/how-to-quote.php Muchas gracias, David. -- Xavier Mas -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: Parsing HTML
Jenda Krynicky said: From: Scott Taylor [EMAIL PROTECTED] I'm probably reinventing the wheel here, but I tried to get along with HTML::Parser and just couldn't get it to do anything. To confusing, I think. I simply want to get a list or real words from an HTML string, minus all the HTML stuff. For example: snipped use HTML::JFilter qw(StripHTML); # http://www24.brinkster.com/jenda/#HTML::JFilter $plain_text = StripHTML($text_with_html); Nice thought, but the link leads nowhere. Cute comic though. Thanks. -- Scott -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Parsing HTML
Jenda Krynicky said: From: Scott Taylor [EMAIL PROTECTED] I'm probably reinventing the wheel here, but I tried to get along with HTML::Parser and just couldn't get it to do anything. To confusing, I think. I simply want to get a list or real words from an HTML string, minus all the HTML stuff. For example: snipped use HTML::JFilter qw(StripHTML); # http://www24.brinkster.com/jenda/#HTML::JFilter $plain_text = StripHTML($text_with_html); Oops, sorry, there it is, just really slow. :) -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
RE: Parsing HTML
Scott Taylor mailto:[EMAIL PROTECTED] wrote: : Is there a better, maybe more eligant, way to do this? I don't : mind to use HTML::Parser if I could only figure out how. use HTML::TokeParser; my $html = q( This is a line of HTML:people write strange things herebr and hardly ever follow properp syntax Aamp;B suck at spelling as wellbr So I need to clean it up and strip out allbr words less then 3 characters in length.p Later the words will go into an indexer forbr searching a database ); my $p = HTML::TokeParser-new( \$html ); while (my $token = $p-get_token) { my $string = $p-get_trimmed_text; $string = \n$string if $token-[1] eq 'br'; $string = \n$string if $token-[1] eq 'p'; print $string; } __END__ HTH, Charles K. Clarkson -- Mobile Homes Specialist 254 968-8328 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Parsing HTML Data
On Sun, 19 Jun 2005, santosh kumar wrote: [Please] help me [with] the perl code reg ex for acheiving the following : ( Please note that we should not use any CPAN modules like HTML::Parser etc ) Please show what you've tried so far. Please explain why you can't use a CPAN module for this. A parser is absolutely the right way to solve this problem. A regex is absolutely the wrong way to solve this problem. And reinventing that which has already been done 1000 times before is absolutely, positively, no doubt at all not a good use of your time :-) But, if you're absolutely set on doing this the way you ask, you *must* clarify *exactly* what you want to extract, and you *must* explain what code you've tried already. Only then can we help sort things out. This list is not a free script writing service. Sorry for any confusion. -- Chris Devers -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Parsing HTML Data
Santosh == Santosh Kumar [EMAIL PROTECTED] writes: Santosh All, Santosh Pls help me the perl code reg ex for acheiving the following : Santosh ( Please note that we should not use any CPAN modules like HTML::Parser etc ) Santosh file.html Santosh html Santosh UserNameinput...santdhdkgfkdfg/td Santosh Passwordinput...h1hh1/h1 Santosh /html Santosh Required Output is : Santosh UserName Santosh Password Homework? -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 merlyn@stonehenge.com URL:http://www.stonehenge.com/merlyn/ Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training! -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
RE: Parsing HTML Tables
Sri Pen mailto:[EMAIL PROTECTED] wrote: : and second @tableDefRows match the data between : TD*..4321191../TDIt matches all the data between : TABLETOBDYTR*.../TR/TBODY/TABLE something is wrong : here. You are getting this because perl regular expressions are greedy by default and because regular expressions alone are a poor substitute for a good parser of HTML. : Do I need to some how start from TABLE and my match all the : way tofirst /td and use some backtracking or something? Perhaps. It is probably best to just scrap the regular expressions and use a module made for parsing HTML. Here's some code using a general HTML parser, but there are many table parsers as well. my $parser = HTML::TokeParser-new( \$html_string ); my @rows; while ( $parser-get_tag( 'tr' ) ) { # check error number next unless $parser-get_trimmed_text( '/td' ) eq ''; # get next cell $parser-get_tag( 'td' ); # check userid next unless $parser-get_trimmed_text( '/td' ) eq 'YHIRA'; my @cells; while ( $parser-get_tag( 'td' ) ) { my $text = $parser-get_trimmed_text( '/td' ); push @cells, $text if $text and $text =~ /4321191/; } push @rows, [EMAIL PROTECTED] if @cells; } HTH, Charles K. Clarkson -- Mobile Homes Specialist 254 968-8328 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
RE: Parsing HTML Tables
Charles K. Clarkson mailto:[EMAIL PROTECTED] wrote: Oopsie! : my @rows; : while ( $parser-get_tag( 'tr' ) ) { : : # check error number : next unless $parser-get_trimmed_text( '/td' ) eq ''; next unless $parser-get_trimmed_text( '/td' ) eq '4321191'; : : # get next cell : $parser-get_tag( 'td' ); : : # check userid : next unless $parser-get_trimmed_text( '/td' ) eq 'YHIRA'; : : my @cells; : while ( $parser-get_tag( 'td' ) ) { : my $text = $parser-get_trimmed_text( '/td' ); : push @cells, $text if $text and $text =~ /4321191/; : } : : push @rows, [EMAIL PROTECTED] if @cells; : } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: parsing HTML
Andrew Gaffney wrote: Here is my current working code. Please take a look at it and see if there are any obvious (or not so obvious) problems. I thought this would end up being far more difficult. snip code Looks good to me. Once you get used to the idea of event based parsing, storing context information on a stack, it's really simple, and even fun. Another nice thing is once you've mastered one (HTML::Parser), you've mastered them all (Pod::Parser, XML::Parser, etc.). Regards, Randy. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
tracking where I am in a tree structure (was: Re: parsing HTML)
Andrew Gaffney wrote: Andrew Gaffney wrote: Randy W. Sims wrote: On 7/21/2004 11:24 PM, Andrew Gaffney wrote: Randy W. Sims wrote: On 7/21/2004 10:42 PM, Andrew Gaffney wrote: I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to: my $html = [ { tag = 'table', id = 'maintable', width = 300, content = [ { tag = 'tr', content = [ { tag = 'td', width = 200, content = some content }, { tag = 'td', width = 100, content = more content } ] ] ]; # Not tested, but you get the idea [snip] I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser? Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack: SNIP Thanks. In the time it took you to put that together, I came up with the following to figure out how HTML::Parser works. I'll use your code to expand upon it. SNIP Here is my current working code. Please take a look at it and see if there are any obvious (or not so obvious) problems. I thought this would end up being far more difficult. parsehtml.pl #!/usr/bin/perl use strict; use warnings; use HTML::Parser (); my $htmltree = [ { tag = 'document', content = [] } ]; my $node = $htmltree-[0]-{content}; my @prevnodes = ($htmltree); sub start { my $tagname = shift; my $attr = shift; my $newnode = {}; $newnode-{tag} = $tagname; foreach my $key(keys %{$attr}) { $newnode-{$key} = $attr-{$key}; } $newnode-{content} = []; push @prevnodes, $node; push @{$node}, $newnode; $node = $newnode-{content}; } sub end { my $tagname = shift; $node = pop @prevnodes; } sub text { my $text = shift; chomp $text; if($text ne '') { push @{$node}, $text; } } my $p = HTML::Parser-new( api_version = 3, start_h = [\start, tagname, attr], end_h = [\end, tagname], text_h = [\text, dtext] ); $p-parse_file(test.html); use Data::Dumper; print Dumper $htmltree; test.html = table id=maintable width=300 tr td width=200some content/td td width=100more content/td /tr /table Now for the next challenge. I need to be able to know where I am in the tree structure for any node that I am in while I am walking it. I will pass along a value via CGI in the form of '0.0.2.1.2' which another script will translate as '$htmltree-[0]-{content}-[0]-{content}-[2]-{content}-[1]-{content}-[2]'. Using the above code, and the following code I wrote for walking the tree and generating HTML from it, how can I mark each outputted HTML tag with its position in the tree? sub descend_htmltree { my $node = shift; my $withclickiness = shift || 0; foreach my $tmpnode (@{$node}) { if(ref($tmpnode) eq 'HASH') { my $nodeid = ; # Magic code to generate node's position in tree $htmloutput .= div style='border: thin solid #bb' onDblClick=\alert('you clicked $nodeid')\ if($withclickiness); $htmloutput .= $tmpnode-{tag}; foreach(keys %{$tmpnode}) { $htmloutput .= $_=\$tmpnode-{$_}\ if($_ ne 'tag' $_ ne 'content'); } $htmloutput .= ; descend_htmltree($tmpnode-{content}); $htmloutput .= /$tmpnode-{tag}; $htmloutput .= /div if($withclickiness); } else { $htmloutput .= $tmpnode; } } } sub htmltree_to_html { my $filename = shift || ''; my $withclickiness = shift || 0; descend_htmltree($htmltree-[0]-{content}, $withclickiness); if($filename ne '') { open HTML, $filename or die Can't open $filename for HTML output; print HTML $htmloutput; close HTML; } return $htmloutput; } -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: parsing HTML
On 7/21/2004 10:42 PM, Andrew Gaffney wrote: I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to: my $html = [ { tag = 'table', id = 'maintable', width = 300, content = [ { tag = 'tr', content = [ { tag = 'td', width = 200, content = some content }, { tag = 'td', width = 100, content = more content } ] ] ]; # Not tested, but you get the idea which would correspond to the following HTML: table id=maintable width=300 tr td width=200some content/td td width=100more content/td /tr /table Once I have the data in the tree, I can easily modify it and transform it back into HTML. Is there a module that can help make this easier or should I go about this differently? HTML::Parser doesn't build a tree, but you can use it to build one if neccessary. However, you might find building a tree is not neccessary. And this is less memory intensive. Then there is HTML::Tree. Regards, Randy. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: parsing HTML
Randy W. Sims wrote: On 7/21/2004 10:42 PM, Andrew Gaffney wrote: I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to: my $html = [ { tag = 'table', id = 'maintable', width = 300, content = [ { tag = 'tr', content = [ { tag = 'td', width = 200, content = some content }, { tag = 'td', width = 100, content = more content } ] ] ]; # Not tested, but you get the idea which would correspond to the following HTML: table id=maintable width=300 tr td width=200some content/td td width=100more content/td /tr /table Once I have the data in the tree, I can easily modify it and transform it back into HTML. Is there a module that can help make this easier or should I go about this differently? HTML::Parser doesn't build a tree, but you can use it to build one if neccessary. However, you might find building a tree is not neccessary. And this is less memory intensive. Then there is HTML::Tree. I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser? -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: parsing HTML
On 7/21/2004 11:24 PM, Andrew Gaffney wrote: Randy W. Sims wrote: On 7/21/2004 10:42 PM, Andrew Gaffney wrote: I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to: my $html = [ { tag = 'table', id = 'maintable', width = 300, content = [ { tag = 'tr', content = [ { tag = 'td', width = 200, content = some content }, { tag = 'td', width = 100, content = more content } ] ] ]; # Not tested, but you get the idea [snip] I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser? Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack: #!/usr/bin/perl package SampleParser; use strict; use HTML::Parser; use base qw(HTML::Parser); sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; my $stack = $self-{_stack}; my $depth = $stack ? @$stack : 0; print ' ' x $depth, $tagname\n; push @{$self-{_stack}}, ' '; } sub end { my($self, $tagname, $origtext) = @_; pop @{$self-{_stack}}; my $stack = $self-{_stack}; my $depth = $stack ? @$stack : 0; print ' ' x $depth, \\$tagname\n; } 1; package main; use strict; use warnings; my $p = SampleParser-new(); $p-parse_file(\*DATA); __DATA__ html head titleTitle/title body The body. /body /html -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: parsing HTML
Randy W. Sims wrote: On 7/21/2004 11:24 PM, Andrew Gaffney wrote: Randy W. Sims wrote: On 7/21/2004 10:42 PM, Andrew Gaffney wrote: I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to: my $html = [ { tag = 'table', id = 'maintable', width = 300, content = [ { tag = 'tr', content = [ { tag = 'td', width = 200, content = some content }, { tag = 'td', width = 100, content = more content } ] ] ]; # Not tested, but you get the idea [snip] I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser? Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack: #!/usr/bin/perl package SampleParser; use strict; use HTML::Parser; use base qw(HTML::Parser); sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; my $stack = $self-{_stack}; my $depth = $stack ? @$stack : 0; print ' ' x $depth, $tagname\n; push @{$self-{_stack}}, ' '; } sub end { my($self, $tagname, $origtext) = @_; pop @{$self-{_stack}}; my $stack = $self-{_stack}; my $depth = $stack ? @$stack : 0; print ' ' x $depth, \\$tagname\n; } 1; package main; use strict; use warnings; my $p = SampleParser-new(); $p-parse_file(\*DATA); __DATA__ html head titleTitle/title body The body. /body /html Thanks. In the time it took you to put that together, I came up with the following to figure out how HTML::Parser works. I'll use your code to expand upon it. #!/usr/bin/perl use strict; use warnings; use HTML::Parser (); sub start { print start ; foreach my $arg (@_) { if(ref($arg) eq 'HASH') { foreach my $key(keys %{$arg}) { print $key - $arg-{$key}\n; } } else { print $arg\n; } } } sub end { print end ; foreach(@_) { print $_\n; } } sub text { my $text = shift; chomp $text; print text - '$text'\n if($text ne ''); } my $p = HTML::Parser-new( api_version = 3, start_h = [\start, tagname, attr], end_h = [\end, tagname], text_h = [\text, dtext], marked_sections = 1 ); # Not sure what this does $p-parse_file(test.html); The above gives me the expected output for the sample HTML I provided before. -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: parsing HTML
Andrew Gaffney wrote: Randy W. Sims wrote: On 7/21/2004 11:24 PM, Andrew Gaffney wrote: Randy W. Sims wrote: On 7/21/2004 10:42 PM, Andrew Gaffney wrote: I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to: my $html = [ { tag = 'table', id = 'maintable', width = 300, content = [ { tag = 'tr', content = [ { tag = 'td', width = 200, content = some content }, { tag = 'td', width = 100, content = more content } ] ] ]; # Not tested, but you get the idea [snip] I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser? Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack: SNIP Thanks. In the time it took you to put that together, I came up with the following to figure out how HTML::Parser works. I'll use your code to expand upon it. SNIP Here is my current working code. Please take a look at it and see if there are any obvious (or not so obvious) problems. I thought this would end up being far more difficult. parsehtml.pl #!/usr/bin/perl use strict; use warnings; use HTML::Parser (); my $htmltree = [ { tag = 'document', content = [] } ]; my $node = $htmltree-[0]-{content}; my @prevnodes = ($htmltree); sub start { my $tagname = shift; my $attr = shift; my $newnode = {}; $newnode-{tag} = $tagname; foreach my $key(keys %{$attr}) { $newnode-{$key} = $attr-{$key}; } $newnode-{content} = []; push @prevnodes, $node; push @{$node}, $newnode; $node = $newnode-{content}; } sub end { my $tagname = shift; $node = pop @prevnodes; } sub text { my $text = shift; chomp $text; if($text ne '') { push @{$node}, $text; } } my $p = HTML::Parser-new( api_version = 3, start_h = [\start, tagname, attr], end_h = [\end, tagname], text_h = [\text, dtext] ); $p-parse_file(test.html); use Data::Dumper; print Dumper $htmltree; test.html = table id=maintable width=300 tr td width=200some content/td td width=100more content/td /tr /table -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Parsing HTML Form Data in Perl
I have written a e-mail script but cannot get the From part of the sendmail protical to recognize e-mails with a period in the user name like [EMAIL PROTECTED] It causes errors. I know if I send brook/.hurd@gm/.com it will work. /. or \.? I cannot locate the code required to parse incomming e-mail formatted data to add / before each period. Any one have any suggestions or a better Idea on how to handle this? [localhost:~] tor% perl -e '$string = test.test\@something; $string =~ s/(\.)/\\$1/g; print $array;' test\.test@something\.com [localhost:~] tor% Or [localhost:~] tor% perl -e '$array = test.test\@something; $array =~ s/(\.)/\/$1/g; print $array;' test/.test@something/.com [localhost:~] tor% =~ s/(\.)/\/$1/g; seems to be what you are looking for. Tor -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]