Re: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.
From: ťÔ Íő [EMAIL PROTECTED] my $text; for my $left_index (1..WIDTH) { last if $start_index $left_index; $text .= $texts_arr[$start_index - $left_index] . ' '; } $text .= join( , @texts_arr[$start_index..$end_index]) . ' '; for my $right_index (1..WIDTH) { last if $end_index + $right_index $#texts_arr; $text .= $texts_arr[$end_index + $right_index] . ' '; } $text_hash{$url} = $text; As far as I can tell this could easily be rewriten with no loops. If I understand it correctly you want to get all the texts from $start_index-WIDTH to $end_index+WIDTH so something like: my $left_index = $start_index - WIDTH; $left_index = 0 if $left_index 0; my $right_index = $end_index + WIDTH; $right_index = $#texts_arr if $right_index $#texts_arr; my $text = join( , @texts_arr[$left_index .. $right_index]); should do what you are after. There are probable other things, but this caught my eyes. Jenda = [EMAIL PROTECTED] === http://Jenda.Krynicky.cz = When it comes to wine, women and song, wizards are allowed to get drunk and croon as much as they like. -- Terry Pratchett in Sourcery -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.
Hello everyone; Recently, when I want to implement Chakrabarti's algorithm using Perl, I found it difficult for me to extract five texts on each side of an URL(except anchor text). I can make my program do its job at last, but it runs slowly. Can anybody tell me how to improve the running speed of this program? Thanks. Below is my own implemented perl module named 'chakrabarti.pm'. #!/usr/bin/perl package chakrabarti; require Exporter; @ISA = qw/Exporter/; @EXPORT = qw/extract_url_and_text/; use warnings; use strict; use HTML::TreeBuilder; use URI; use constant WIDTH = 5; my @texts_arr = (); my @anchor_index = (); sub extract_url_and_text{ my($html_ref, $base_ref) = @_; my %text_hash; my $tree = HTML::TreeBuilder-new_from_content(${$html_ref}); my $body_tag = $tree-find_by_tag_name('body'); process($body_tag); for (@anchor_index) { my ($start_index, $end_index, $url) = ($_-[0], $_-[1], $_-[2]); $url = URI-new_abs($url, ${$base_ref}); my $text; for my $left_index (1..WIDTH) { last if $start_index $left_index; $text .= $texts_arr[$start_index - $left_index] . ' '; } $text .= join( , @texts_arr[$start_index..$end_index]) . ' '; for my $right_index (1..WIDTH) { last if $end_index + $right_index $#texts_arr; $text .= $texts_arr[$end_index + $right_index] . ' '; } $text_hash{$url} = $text; } $tree-delete; return [\%text_hash]; } sub process { my $tag = shift; my ($start_index, $end_index, $url); if ($tag-tag eq 'a') { $start_index = @texts_arr; $url = $tag-attr('href'); } foreach my $kid ($tag-content_list) { if (ref $kid) { process($kid); } else { push @texts_arr, $kid; } } if ($tag-tag eq 'a') { $end_index = @texts_arr - 1; push @anchor_index, [$start_index, $end_index, $url]; } } 1; Then, in my perl program, I can invoke this module. Below is a working example: use warnings; use strict; use LWP::UserAgent; use chakrabarti; my $ua = LWP::UserAgent-new; my $res = $ua-get('http://www.cpan.org/'); if($res-is_success){ my $url_text_ref = extract_url_and_text($res-content_ref, $res-base); for(keys %{$url_text_ref-[0]}){ print $_, \n, ${$url_text_ref-[0]}{$_}, \n\n; } } Below is the Chakrabarti's article: http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf Good luck! Hui Wang - 抢注雅虎免费邮箱-3.5G容量,20M附件!
Re: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.
On Sunday 12 November 2006 13:17, 辉 王 wrote: I can make my program do its job at last, but it runs slowly. Can anybody tell me how to improve the running speed of this program? Thanks. Have you had a look with the Perl profiler to see which bits are going slow. That way you know to look at make them run faster. See perldoc Devel::DProf for more information. -- Robin [EMAIL PROTECTED] JabberID: [EMAIL PROTECTED] Hostes alienigeni me abduxerunt. Qui annus est? PGP Key 0xA99CEB6D = 5957 6D23 8B16 EFAB FEF8 7175 14D3 6485 A99C EB6D pgpIhJEoay9Ke.pgp Description: PGP signature
RE: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.
Hui Wang mailto:[EMAIL PROTECTED] wrote: : Can anybody tell me how to improve the running speed of this : program? Thanks. I don't know if this is faster, but it is a more accurate solution. Your submitted code failed under some untested circumstances. I created another page similar to the CPAN page you used and fed it more complicated tests. Chakrabarti placed relevance on distance from the link. I changed your report to reflect this relevance. Instead of squashing all text together, it now shows a report of text token relevance. This change allowed me to test more thoroughly as well. Here is the sample report for one link with multiple texts inside the anchor. http://www.clarksonenergyhomes.com/scripts/index.html -5: 3401 MB 280 mirrors -4: 5501 authors 10789 modules -3: Welcome to CPAN! Here you will find All Things Perl. -2: Browsing -1: Perl modules 0: Perl 0: scripts +1: Perl binary distributions (ports) +2: Perl source code +3: Perl recent arrivals +4: recent +5: Perl modules You can find the modified code here (for a short time): Script: http://www.clarksonenergyhomes.com/chakrabarti.txt Module: http://www.clarksonenergyhomes.com/chakrabarti.pm HTH, Charles K. Clarkson -- Mobile Homes Specialist Free Market Advocate Web Programmer 254 968-8328 http://www.clarksonenergyhomes.com/ Don't tread on my bandwidth. Trim your posts. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.
Hello, everyone, Recently, when I want to implement Chakrabarti's algorithm using Perl, I found it difficult for me to extract five texts on each side of an URL(except anchor text). I can make my program do its job at last, but it runs slowly. Can anybody tell me how to improve the running speed of this program? Thanks. Below is my own implemented perl module named 'chakrabarti.pm'. #!/usr/bin/perl package chakrabarti; require Exporter; @ISA = qw/Exporter/; @EXPORT = qw/extract_url_and_text/; use warnings; use strict; use HTML::TreeBuilder; use URI; use constant WIDTH = 5; my @texts_arr = (); my @anchor_index = (); sub extract_url_and_text{ my($html_ref, $base_ref) = @_; my %text_hash; my $tree = HTML::TreeBuilder-new_from_content(${$html_ref}); my $body_tag = $tree-find_by_tag_name('body'); process($body_tag); for (@anchor_index) { my ($start_index, $end_index, $url) = ($_-[0], $_-[1], $_-[2]); $url = URI-new_abs($url, ${$base_ref}); my $text; for my $left_index (1..WIDTH) { last if $start_index $left_index; $text .= $texts_arr[$start_index - $left_index] . ' '; } $text .= join( , @texts_arr[$start_index..$end_index]) . ' '; for my $right_index (1..WIDTH) { last if $end_index + $right_index $#texts_arr; $text .= $texts_arr[$end_index + $right_index] . ' '; } $text_hash{$url} = $text; } $tree-delete; return [\%text_hash]; } sub process { my $tag = shift; my ($start_index, $end_index, $url); if ($tag-tag eq 'a') { $start_index = @texts_arr; $url = $tag-attr('href'); } foreach my $kid ($tag-content_list) { if (ref $kid) { process($kid); } else { push @texts_arr, $kid; } } if ($tag-tag eq 'a') { $end_index = @texts_arr - 1; push @anchor_index, [$start_index, $end_index, $url]; } } 1; Then, in my perl program, I can invoke this module. Below is a working example: use warnings; use strict; use LWP::UserAgent; use chakrabarti; my $ua = LWP::UserAgent-new; my $res = $ua-get('http://www.cpan.org/'); if($res-is_success){ my $url_text_ref = extract_url_and_text($res-content_ref, $res-base); for(keys %{$url_text_ref-[0]}){ print $_, \n, ${$url_text_ref-[0]}{$_}, \n\n; } } Below is the Chakrabarti's article: http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf Good luck! Hui Wang - 抢注雅虎免费邮箱-3.5G容量,20M附件!