Assigning to a list of variables from a regex

2006-11-11 Thread Nigel Peck


Hi all,

I'm trying to get the following to work and can't. It's the assignment 
to $val1 and $val2 that's causing me the problem.


my ( $val1, $val2 ) = $data =~ /^([^:]+):([^:]+)$/
|| die 'Failed to parse data';

What am I doing wrong? I can do it on multiple lines by assigning $1 and 
$2 to the variables after the regex but I want to do it all on one line.


TIA

Nigel

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Assigning to a list of variables from a regex

2006-11-11 Thread Sebastian Stumpf
On Sat, 11 Nov 2006 17:11:45 +
Nigel Peck <[EMAIL PROTECTED]> wrote:
> my ( $val1, $val2 ) = $data =~ /^([^:]+):([^:]+)$/
>   || die 'Failed to parse data';

Just use brackets around the regexp:
my ($val1, $val2) = ($data =~ //); 
die unless $val1 && $val2;

Greetings
Sebastian
-- 
VI VI VI - The editor of the beast.
- perlhacker.org

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Assigning to a list of variables from a regex

2006-11-11 Thread Rob Dixon

Nigel Peck wrote:
>
> Hi all,
>
> I'm trying to get the following to work and can't. It's the assignment
> to $val1 and $val2 that's causing me the problem.
>
> my ( $val1, $val2 ) = $data =~ /^([^:]+):([^:]+)$/
> || die 'Failed to parse data';
>
> What am I doing wrong? I can do it on multiple lines by assigning $1 and
> $2 to the variables after the regex but I want to do it all on one line.

Hi Nigel

The || operator has a higher priority than assignment, so you've written:

my ($val1, $val2) = ($data =~ /^([^:]+):([^:]+)$/ || die 'Failed to parse 
data');

This forces the regex match into scalar context, so it returns 1 if it succeeds,
which leaves $val1 equal to 1 and $val2 undefined.

Instead, use the low-precedence 'or' operator:

my ($val1, $val2) = $data =~ /^([^:]+):([^:]+)$/ or die 'Failed to parse data';

and all will be well.

HTH,

Rob#

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Assigning to a list of variables from a regex

2006-11-11 Thread Dr.Ruud
Nigel Peck schreef:

> I'm trying to get the following to work and can't. It's the
> assignment to $val1 and $val2 that's causing me the problem.
>
> my ( $val1, $val2 ) = $data =~ /^([^:]+):([^:]+)$/
>   || die 'Failed to parse data';
>
> What am I doing wrong?

See perlop. Change the "||" to "or".


> I can do it on multiple lines by assigning $1
> and $2 to the variables after the regex but I want
> to do it all on one line.

I tend to put () around the =~ expression too:

  my ( $val1, $val2 ) = ($data =~ /^([^:]+):([^:]+)$/)
or die 'Failed to parse data';

Is that second "[^:]+" really what you mean? Change to ".*" for more
matches.

-- 
Affijn, Ruud

"Gewoon is een tijger."


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Assigning to a list of variables from a regex

2006-11-11 Thread Rob Dixon

Rob Dixon wrote:

Nigel Peck wrote:
 >
 > Hi all,
 >
 > I'm trying to get the following to work and can't. It's the assignment
 > to $val1 and $val2 that's causing me the problem.
 >
 > my ( $val1, $val2 ) = $data =~ /^([^:]+):([^:]+)$/
 > || die 'Failed to parse data';
 >
 > What am I doing wrong? I can do it on multiple lines by assigning $1 and
 > $2 to the variables after the regex but I want to do it all on one line.

Hi Nigel

The || operator has a higher priority than assignment, so you've written:

my ($val1, $val2) = ($data =~ /^([^:]+):([^:]+)$/ || die 'Failed to 
parse data');


This forces the regex match into scalar context, so it returns 1 if it 
succeeds,

which leaves $val1 equal to 1 and $val2 undefined.

Instead, use the low-precedence 'or' operator:

my ($val1, $val2) = $data =~ /^([^:]+):([^:]+)$/ or die 'Failed to parse 
data';


and all will be well.


By the way, also consider:

  my @data = $data =~ /[^:]+/g;
  die unless @data == 2;
  my ($val1, $val2) = @data;

or similar.

Rob


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Assigning to a list of variables from a regex

2006-11-11 Thread Rob Dixon

Sebastian Stumpf wrote:
>

On Sat, 11 Nov 2006 17:11:45 +
Nigel Peck <[EMAIL PROTECTED]> wrote:

>>

my ( $val1, $val2 ) = $data =~ /^([^:]+):([^:]+)$/
|| die 'Failed to parse data';


Just use brackets around the regexp:
my ($val1, $val2) = ($data =~ //); 


That's the same as the OP wrote: =~ hash a higher priority than =


die unless $val1 && $val2;


Will die if the original string was something like 'COUNT:0'

Rob

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Assigning to a list of variables from a regex

2006-11-11 Thread Nigel Peck


Thanks Rob and Dr Ruud, exactly what I needed to know :)

Thanks for your input Sebastian :)

Cheers,
Nigel

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-11 Thread 辉 王
Hello everyone;
   
Recently, when I want to implement Chakrabarti's algorithm 
 
using Perl, I found it difficult for me to extract five texts on 
 
each side of an URL(except anchor text). 
 
I can make my program do its job at last, but it runs slowly. 
   
Can anybody tell me how to improve the running speed of this  
   
program? Thanks.

Below is my own implemented perl module named 'chakrabarti.pm'.

#!/usr/bin/perl
package chakrabarti;
require Exporter;
@ISA = qw/Exporter/;
@EXPORT = qw/extract_url_and_text/;
use warnings;
use strict;
use HTML::TreeBuilder;
use URI;
use constant WIDTH => 5;

my @texts_arr = ();
my @anchor_index = ();

sub extract_url_and_text{
my($html_ref, $base_ref) = @_;
 my %text_hash;
 my $tree = HTML::TreeBuilder->new_from_content(${$html_ref});
 my $body_tag = $tree->find_by_tag_name('body');
 &process($body_tag);
 for (@anchor_index) {  
  my ($start_index, $end_index, $url) = ($_->[0], $_->[1], $_->[2]);
  $url = URI->new_abs($url, ${$base_ref});
  my $text;
  for my $left_index (1..WIDTH) {
   last if $start_index < $left_index; 
 $text .= $texts_arr[$start_index - $left_index] . ' ';
  }
  $text .= join(" ", @texts_arr[$start_index..$end_index]) . ' ';
   for my $right_index (1..WIDTH) {
last if $end_index + $right_index > $#texts_arr;
 $text .= $texts_arr[$end_index + $right_index] . ' ';
  }
   $text_hash{$url} = $text;
 }
 $tree->delete;
 return [\%text_hash];
}
sub process {
my $tag = shift;
my ($start_index, $end_index, $url);
  if ($tag->tag eq 'a') {
   $start_index = @texts_arr;
$url = $tag->attr('href');
  }
  foreach my $kid ($tag->content_list) {
   if (ref $kid) {
 &process($kid);
  } else {
  push @texts_arr, $kid;
}
  }
  if ($tag->tag eq 'a') {
 $end_index = @texts_arr - 1;
 push @anchor_index, [$start_index, $end_index, $url];
  }
}
1;

Then, in my perl program, I can invoke this module. Below is a working
example:
   
use warnings;
use strict;
use LWP::UserAgent;
use chakrabarti;
  
my $ua = LWP::UserAgent->new;
my $res = $ua->get('http://www.cpan.org/');
if($res->is_success){
 my $url_text_ref = extract_url_and_text($res->content_ref, $res->base);
 for(keys %{$url_text_ref->[0]}){
 print $_, "\n", ${$url_text_ref->[0]}{$_}, "\n\n";
}
}
   
Below is the Chakrabarti's article:
http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf
   
Good luck!
   
Hui Wang


-
抢注雅虎免费邮箱-3.5G容量,20M附件! 

Re: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-11 Thread Robin Sheat
On Sunday 12 November 2006 13:17, 辉 王 wrote:
> I can make my program do its job at last, but it runs slowly.
> Can anybody tell me how to improve the running speed of this  
> program? Thanks.
Have you had a look with the Perl profiler to see which bits are going slow. 
That way you know to look at make them run faster. See perldoc Devel::DProf 
for more information.

-- 
Robin <[EMAIL PROTECTED]> JabberID: <[EMAIL PROTECTED]>

Hostes alienigeni me abduxerunt. Qui annus est?

PGP Key 0xA99CEB6D = 5957 6D23 8B16 EFAB FEF8  7175 14D3 6485 A99C EB6D


pgpIhJEoay9Ke.pgp
Description: PGP signature


RE: Hi, how to extract five texts on each side of an URI? I post my own perl script and its use.

2006-11-11 Thread Charles K. Clarkson
Hui Wang  wrote:

: Can anybody tell me how to improve the running speed of this
: program? Thanks.

I don't know if this is faster, but it is a more accurate
solution. Your submitted code failed under some untested
circumstances. I created another page similar to the CPAN page you
used and fed it more complicated tests.

Chakrabarti placed relevance on distance from the link. I
changed your report to reflect this relevance. Instead of
squashing all text together, it now shows a report of text token
relevance. This change allowed me to test more thoroughly as well.
Here is the sample report for one link with multiple texts inside
the anchor.

http://www.clarksonenergyhomes.com/scripts/index.html
-5: 3401 MB 280 mirrors
-4: 5501 authors 10789 modules
-3: Welcome to CPAN! Here you will find All Things Perl.
-2: Browsing
-1: Perl modules
 0: Perl
 0: scripts
+1: Perl binary distributions ("ports")
+2: Perl source code
+3: Perl recent arrivals
+4: recent
+5: Perl modules

You can find the modified code here (for a short time):

Script: http://www.clarksonenergyhomes.com/chakrabarti.txt
Module: http://www.clarksonenergyhomes.com/chakrabarti.pm


HTH,

Charles K. Clarkson
--
Mobile Homes Specialist
Free Market Advocate
Web Programmer

254 968-8328

http://www.clarksonenergyhomes.com/

Don't tread on my bandwidth. Trim your posts.



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]