Re: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-14 Thread Jenda Krynicky
From: ťÔ Íő [EMAIL PROTECTED]
   my $text;
   for my $left_index (1..WIDTH) {
last if $start_index  $left_index;
  $text .= $texts_arr[$start_index - $left_index] . ' ';
   }
   $text .= join( , @texts_arr[$start_index..$end_index]) . ' ';
for my $right_index (1..WIDTH) {
 last if $end_index + $right_index  $#texts_arr;
  $text .= $texts_arr[$end_index + $right_index] . ' ';
   }
$text_hash{$url} = $text;

As far as I can tell this could easily be rewriten with no loops. If
I understand it correctly you want to get all the texts from
$start_index-WIDTH to $end_index+WIDTH so something like:


my $left_index = $start_index - WIDTH;
$left_index = 0 if $left_index  0;
my $right_index = $end_index + WIDTH;
$right_index = $#texts_arr if $right_index  $#texts_arr;

my $text = join( , @texts_arr[$left_index .. $right_index]);

should do what you are after. There are probable other things, but
this caught my eyes.

Jenda
= [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-11 Thread 辉 王
Hello everyone;
   
Recently, when I want to implement Chakrabarti's algorithm 
 
using Perl, I found it difficult for me to extract five texts on 
 
each side of an URL(except anchor text). 
 
I can make my program do its job at last, but it runs slowly. 
   
Can anybody tell me how to improve the running speed of this  
   
program? Thanks.

Below is my own implemented perl module named 'chakrabarti.pm'.

#!/usr/bin/perl
package chakrabarti;
require Exporter;
@ISA = qw/Exporter/;
@EXPORT = qw/extract_url_and_text/;
use warnings;
use strict;
use HTML::TreeBuilder;
use URI;
use constant WIDTH = 5;

my @texts_arr = ();
my @anchor_index = ();

sub extract_url_and_text{
my($html_ref, $base_ref) = @_;
 my %text_hash;
 my $tree = HTML::TreeBuilder-new_from_content(${$html_ref});
 my $body_tag = $tree-find_by_tag_name('body');
 process($body_tag);
 for (@anchor_index) {  
  my ($start_index, $end_index, $url) = ($_-[0], $_-[1], $_-[2]);
  $url = URI-new_abs($url, ${$base_ref});
  my $text;
  for my $left_index (1..WIDTH) {
   last if $start_index  $left_index; 
 $text .= $texts_arr[$start_index - $left_index] . ' ';
  }
  $text .= join( , @texts_arr[$start_index..$end_index]) . ' ';
   for my $right_index (1..WIDTH) {
last if $end_index + $right_index  $#texts_arr;
 $text .= $texts_arr[$end_index + $right_index] . ' ';
  }
   $text_hash{$url} = $text;
 }
 $tree-delete;
 return [\%text_hash];
}
sub process {
my $tag = shift;
my ($start_index, $end_index, $url);
  if ($tag-tag eq 'a') {
   $start_index = @texts_arr;
$url = $tag-attr('href');
  }
  foreach my $kid ($tag-content_list) {
   if (ref $kid) {
 process($kid);
  } else {
  push @texts_arr, $kid;
}
  }
  if ($tag-tag eq 'a') {
 $end_index = @texts_arr - 1;
 push @anchor_index, [$start_index, $end_index, $url];
  }
}
1;

Then, in my perl program, I can invoke this module. Below is a working
example:
   
use warnings;
use strict;
use LWP::UserAgent;
use chakrabarti;
  
my $ua = LWP::UserAgent-new;
my $res = $ua-get('http://www.cpan.org/');
if($res-is_success){
 my $url_text_ref = extract_url_and_text($res-content_ref, $res-base);
 for(keys %{$url_text_ref-[0]}){
 print $_, \n, ${$url_text_ref-[0]}{$_}, \n\n;
}
}
   
Below is the Chakrabarti's article:
http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf
   
Good luck!
   
Hui Wang


-
抢注雅虎免费邮箱-3.5G容量,20M附件! 

Re: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-11 Thread Robin Sheat
On Sunday 12 November 2006 13:17, 辉 王 wrote:
 I can make my program do its job at last, but it runs slowly.
 Can anybody tell me how to improve the running speed of this  
 program? Thanks.
Have you had a look with the Perl profiler to see which bits are going slow. 
That way you know to look at make them run faster. See perldoc Devel::DProf 
for more information.

-- 
Robin [EMAIL PROTECTED] JabberID: [EMAIL PROTECTED]

Hostes alienigeni me abduxerunt. Qui annus est?

PGP Key 0xA99CEB6D = 5957 6D23 8B16 EFAB FEF8  7175 14D3 6485 A99C EB6D


pgpIhJEoay9Ke.pgp
Description: PGP signature


RE: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-11 Thread Charles K. Clarkson
Hui Wang mailto:[EMAIL PROTECTED] wrote:

: Can anybody tell me how to improve the running speed of this
: program? Thanks.

I don't know if this is faster, but it is a more accurate
solution. Your submitted code failed under some untested
circumstances. I created another page similar to the CPAN page you
used and fed it more complicated tests.

Chakrabarti placed relevance on distance from the link. I
changed your report to reflect this relevance. Instead of
squashing all text together, it now shows a report of text token
relevance. This change allowed me to test more thoroughly as well.
Here is the sample report for one link with multiple texts inside
the anchor.

http://www.clarksonenergyhomes.com/scripts/index.html
-5: 3401 MB 280 mirrors
-4: 5501 authors 10789 modules
-3: Welcome to CPAN! Here you will find All Things Perl.
-2: Browsing
-1: Perl modules
 0: Perl
 0: scripts
+1: Perl binary distributions (ports)
+2: Perl source code
+3: Perl recent arrivals
+4: recent
+5: Perl modules

You can find the modified code here (for a short time):

Script: http://www.clarksonenergyhomes.com/chakrabarti.txt
Module: http://www.clarksonenergyhomes.com/chakrabarti.pm


HTH,

Charles K. Clarkson
--
Mobile Homes Specialist
Free Market Advocate
Web Programmer

254 968-8328

http://www.clarksonenergyhomes.com/

Don't tread on my bandwidth. Trim your posts.



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-09 Thread 辉 王
Hello, everyone,
   
Recently, when I want to implement Chakrabarti's algorithm 
 
using Perl, I found it difficult for me to extract five texts on 
 
each side of an URL(except anchor text). 
 
I can make my program do its job at last, but it runs slowly. 
   
Can anybody tell me how to improve the running speed of this  
   
program? Thanks.

  Below is my own implemented perl module named 'chakrabarti.pm'.
   
  #!/usr/bin/perl
package chakrabarti;
require Exporter;
@ISA = qw/Exporter/;
@EXPORT = qw/extract_url_and_text/;
use warnings;
use strict;
use HTML::TreeBuilder;
use URI;
use constant WIDTH = 5;
  my @texts_arr = ();
my @anchor_index = ();
  sub extract_url_and_text{
 my($html_ref, $base_ref) = @_;
 my %text_hash;
 my $tree = HTML::TreeBuilder-new_from_content(${$html_ref});
 my $body_tag = $tree-find_by_tag_name('body');
 process($body_tag);
 for (@anchor_index) {  
   my ($start_index, $end_index, $url) = ($_-[0], $_-[1], $_-[2]);
   $url = URI-new_abs($url, ${$base_ref});
  my $text;
  for my $left_index (1..WIDTH) {
   last if $start_index  $left_index; 
 $text .= $texts_arr[$start_index - $left_index] . ' ';
  }
  $text .= join( , @texts_arr[$start_index..$end_index]) . ' ';
   for my $right_index (1..WIDTH) {
last if $end_index + $right_index  $#texts_arr;
 $text .= $texts_arr[$end_index + $right_index] . ' ';
  }
   $text_hash{$url} = $text;
 }
 $tree-delete;
 return [\%text_hash];
}
sub process {
 my $tag = shift;
  my ($start_index, $end_index, $url);
  if ($tag-tag eq 'a') {
   $start_index = @texts_arr;
$url = $tag-attr('href');
  }
  foreach my $kid ($tag-content_list) {
   if (ref $kid) {
 process($kid);
  } else {
   push @texts_arr, $kid;
}
  }
  if ($tag-tag eq 'a') {
 $end_index = @texts_arr - 1;
 push @anchor_index, [$start_index, $end_index, $url];
  }
}
1;

  Then, in my perl program, I can invoke this module. Below is a working 
   
  example:
   
  use warnings;
use strict;
use LWP::UserAgent;
use chakrabarti;
  my $ua = LWP::UserAgent-new;
my $res = $ua-get('http://www.cpan.org/');
if($res-is_success){
 my $url_text_ref = extract_url_and_text($res-content_ref, $res-base);
 for(keys %{$url_text_ref-[0]}){
  print $_, \n, ${$url_text_ref-[0]}{$_}, \n\n;
 }
}
   
  Below is the Chakrabarti's article:
http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf
   
 Good luck!
   
 Hui Wang

   
  
 


-
抢注雅虎免费邮箱-3.5G容量,20M附件!