Usage: HTML::Parser::parse(self, chunk)

2001-04-10 Thread Matej Kovacic

Hi!

I have written a program in Perl which reads URL's from file and then make
some simple analysis of them and extract all the links. Links are written in
file results.net, info about URL in results.out.
I run the program, and after analysing several thousent of pages, I got an
error:
===
Uncaught exception from user code:
Usage: HTML::Parser::parse(self, chunk) at jure.pl line ---.

main::analyse('http://alibaba.ijs.si/ME/CD/docs/CES1/dtd2html/cesAna/distrib
uto...') called at jure.pl line 103

===
Here is the program:

#!/usr/bin/perl -w
use CGI;

use HTML::LinkExtor;
use LWP::Simple;
use HTTP::Response;
use LWP;
use LWP::UserAgent;
use URI::URL;
use strict;
use diagnostics;

# HTTP Response Codes on Has(c)h (RFC2068)
###
my %statuscode = (
100   = 'Continue',
101   = 'Switching Protocols',
200   = 'OK',
201   = 'Created',
202   = 'Accepted',
203   = 'Non-Authoritative Information',
204   = 'No Content',
205   = 'Reset Content',
206   = 'Partial Content',
300   = 'Multiple Choices',
301   = 'Moved Permanently',
302   = 'Moved Temporarily',
303   = 'See Other',
304   = 'Not Modified',
305   = 'Use Proxy',
400   = 'Bad Request',
401   = 'Unauthorized',
402   = 'Payment Required',
403   = 'Forbidden',
404   = 'Not Found',
405   = 'Method Not Allowed',
406   = 'Not Acceptable',
407   = 'Proxy Authentication Required',
408   = 'Request Time-out',
409   = 'Conflict',
410   = 'Gone',
411   = 'Length Required',
412   = 'Precondition Failed',
413   = 'Request Entity Too Large',
414   = 'Request-URI Too Large',
415   = 'Unsupported Media Type',
500   = 'Internal Server Error',
501   = 'Not Implemented',
502   = 'Bad Gateway',
503   = 'Service Unavailable',
504   = 'Gateway Time-out',
505   = 'HTTP Version not supported'
);

my $filesdir = "files";
$| = 1;

# print "Content-type: text/html\n\n";

BEGIN {
  open(OUT,"results/result.out") || print "Error!";
  open(NET,"results/result.net") || print "Error!";
}
END {
  close(NET);
  close(OUT);
}

  my $file = "si-url.txt";
  if ($file ne '') {

  print "Analysing file: $file";
  open(FILE,"$filesdir/$file") || print "Error - no file $file.";
  local $\ = undef;
  my @file = FILE;
  close(FILE);

  print "START:\n";

  my $count = 1;
  foreach my $main_line (@file) {
$/="\r\n";
chomp($main_line);
$/="\n";
my $base_url = "$main_line";
print "Line number: $count\n";
if ($count  1665) {
  analyse($base_url);
}
$count++;
  }



###
# Analyse page!
###
sub analyse {

  my $url=shift;

  chomp($url);

  my $browser = LWP::UserAgent-new();
  $browser-agent("MatejKovacicGregaPetric/InternetResearchProject");

  my $webdoc = $browser-request(HTTP::Request-new(GET = $url));

  my $responsecode = $webdoc-code;

  if ($webdoc-is_success) {

# COUNT images
@main::images=();
@main::images = $webdoc-content =~
m{
   \s*   img
 \s*   ["|']?  (  [^\s'"]+  )   ['|"]?
 }xgm;
$main::num_img=@main::images;

my $string = $webdoc-content;
$string =~ s/\n//g; # remove all newlines!
$string =~ s/\r//g; # remove all carridge returns!
$string =~ s/\t//g; # remove all tabs!

# LENGTH of page
my $size1 = length($string); # get the size of a string!

$string =~ s/([^]|\n)*//g; # remove all HTML tags
$string =~ s/  //g; # remove all double spaces!

# LENGTH of text on the page
my $size2 = length($string); # get the size of a shortened string!

print OUT "$url  $responsecode  $size1  $size2  ",$webdoc-base,
"  ",$webdoc-content_type,"  ",$webdoc-title,
"  $main::num_img\n";

# EXTRACT links
$main::base_url = $webdoc-base;
my $parser = HTML::LinkExtor-new(undef, $main::base_url);
$parser-parse(get($main::base_url))-eof;

@main::links = $parser-links;
my %seen;

foreach my $linkarray (@main::links) {
  @main::element=();
  @main::element  = @$linkarray;
  my $elt_type = shift @main::element;
  while (@main::element) {
my ($attr_name , $attr_value) = splice(@main::element, 0, 2);
$seen{$attr_value}++;
  }
}

@main::arr = sort keys %seen;

my $i = 0;
for (sort keys %seen) {
  print NET "$main::base_url - $main::arr[$i]\n";
  $i++;
}

  }
  else {
print OUT "$url  $responsecode  .  .  .  .  .  .\n";
  }
}


Do you have any idea how to get the solution of this problem?

I have to tell you I am quite new in Perl programming... but I am trying
hard.

Thanks in advance.

bye, Matej




Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7

2001-04-10 Thread Curt Powell

Dear sir or madam,

I have encountered a memory leak under RedHat Linux (versions 6.1 and 7),
Perl 5.005 and 5.6.  It occurs with multiple calls to LWP::UserAgent and
HTTP::Request.  Following is a short script that demonstrates the problem.
On RH7 it shows memory deltas of 8k every 10 or so iterations after the
first iteration.  The amount leaked doesn't seem to be related to the size
of the page downloaded.  Is it possible that I am not doing the call
sequence correctly?

Regards,

Curt Powell

#!/usr/bin/perl
#usage: ./memtest url e.g. ./memtest http://www.sierraridge.com

sub geturl()
{
use LWP::UserAgent;
use HTTP::Request;
my $URL = shift;
my $UA = LWP::UserAgent-new();
my $Request = HTTP::Request-new(GET = $URL);
my $Response = $UA-request($Request);
print "Error retrieving $URL\n" if ($Response-is_error());
return $Response-as_string;
}

sub memused
{
local *memused_TMP_FILE;
open(memused_TMP_FILE, "/proc/$$/stat");
my $a = memused_TMP_FILE;
close memused_TMP_FILE;
my @b = split(' ', $a);
return $b[22];
}

$url = shift ARGV;
$lastused = memused();
for ($i=0; $i=100; ++$i)
{
$length = length(geturl($url));
$used = memused;
$delta = $used - $lastused;
print "$i: response length: $length memory used: $used memory change:
$delta\n";
$lastused = $used;
}





Re: Usage: HTML::Parser::parse(self, chunk)

2001-04-10 Thread Gisle Aas

"Matej Kovacic" [EMAIL PROTECTED] writes:

 Uncaught exception from user code:
 Usage: HTML::Parser::parse(self, chunk) at jure.pl line ---.

[...]

 # HTTP Response Codes on Has(c)h (RFC2068)
 ###
 my %statuscode = (
   100   = 'Continue',
   101   = 'Switching Protocols',
   200   = 'OK',
   201   = 'Created',

BTW, this data is also available from the HTTP::Status module.

 my $parser = HTML::LinkExtor-new(undef, $main::base_url);
 $parser-parse(get($main::base_url))-eof;

The error probably shows up because get() returns nothing instead of
undef.  I think the following patch should fix the problem:

Index: lib/LWP/Simple.pm
===
RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/Simple.pm,v
retrieving revision 1.33
diff -u -p -u -r1.33 Simple.pm
--- lib/LWP/Simple.pm   2000/05/24 09:40:43 1.33
+++ lib/LWP/Simple.pm   2001/04/10 17:12:27
@@ -298,7 +298,7 @@ sub _trivial_http_get
my $sock = IO::Socket::INET-new(PeerAddr = $host,
 PeerPort = $port,
 Proto= 'tcp',
-Timeout  = 60) || return;
+Timeout  = 60) || return undef;
$sock-autoflush;
my $netloc = $host;
$netloc .= ":$port" if $port != 80;

Regards,
Gisle



Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7

2001-04-10 Thread Gisle Aas

"Curt Powell" [EMAIL PROTECTED] writes:

 I have encountered a memory leak under RedHat Linux (versions 6.1 and 7),
 Perl 5.005 and 5.6.  It occurs with multiple calls to LWP::UserAgent and
 HTTP::Request.  Following is a short script that demonstrates the problem.
 On RH7 it shows memory deltas of 8k every 10 or so iterations after the
 first iteration.  The amount leaked doesn't seem to be related to the size
 of the page downloaded.  Is it possible that I am not doing the call
 sequence correctly?

Seems good enough to me.  I also see memory leaking here.
I'll try to investigate.

Regards,
Gisle


 
 Regards,
 
 Curt Powell
 
 #!/usr/bin/perl
 #usage: ./memtest url e.g. ./memtest http://www.sierraridge.com
 
 sub geturl()
 {
   use LWP::UserAgent;
   use HTTP::Request;
   my $URL = shift;
   my $UA = LWP::UserAgent-new();
   my $Request = HTTP::Request-new(GET = $URL);
   my $Response = $UA-request($Request);
   print "Error retrieving $URL\n" if ($Response-is_error());
   return $Response-as_string;
 }
 
 sub memused
 {
   local *memused_TMP_FILE;
   open(memused_TMP_FILE, "/proc/$$/stat");
   my $a = memused_TMP_FILE;
   close memused_TMP_FILE;
   my @b = split(' ', $a);
   return $b[22];
 }
 
 $url = shift ARGV;
 $lastused = memused();
 for ($i=0; $i=100; ++$i)
 {
   $length = length(geturl($url));
   $used = memused;
   $delta = $used - $lastused;
   print "$i: response length: $length memory used: $used memory change:
 $delta\n";
   $lastused = $used;
 }



Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7

2001-04-10 Thread Gisle Aas

Gisle Aas [EMAIL PROTECTED] writes:

 "Curt Powell" [EMAIL PROTECTED] writes:
 
  I have encountered a memory leak under RedHat Linux (versions 6.1 and 7),
  Perl 5.005 and 5.6.  It occurs with multiple calls to LWP::UserAgent and
  HTTP::Request.  Following is a short script that demonstrates the problem.
  On RH7 it shows memory deltas of 8k every 10 or so iterations after the
  first iteration.  The amount leaked doesn't seem to be related to the size
  of the page downloaded.  Is it possible that I am not doing the call
  sequence correctly?
 
 Seems good enough to me.  I also see memory leaking here.
 I'll try to investigate.

Did you use HTML-Parser 3.20 for your test?

The memory leak went away for me when I downgraded to HTML-Parser 3.19.

Regards,
Gisle


  #!/usr/bin/perl
  #usage: ./memtest url e.g. ./memtest http://www.sierraridge.com
  
  sub geturl()
  {
  use LWP::UserAgent;
  use HTTP::Request;
  my $URL = shift;
  my $UA = LWP::UserAgent-new();
  my $Request = HTTP::Request-new(GET = $URL);
  my $Response = $UA-request($Request);
  print "Error retrieving $URL\n" if ($Response-is_error());
  return $Response-as_string;
  }
  
  sub memused
  {
  local *memused_TMP_FILE;
  open(memused_TMP_FILE, "/proc/$$/stat");
  my $a = memused_TMP_FILE;
  close memused_TMP_FILE;
  my @b = split(' ', $a);
  return $b[22];
  }
  
  $url = shift ARGV;
  $lastused = memused();
  for ($i=0; $i=100; ++$i)
  {
  $length = length(geturl($url));
  $used = memused;
  $delta = $used - $lastused;
  print "$i: response length: $length memory used: $used memory change:
  $delta\n";
  $lastused = $used;
  }



RE: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7

2001-04-10 Thread Curt Powell

Yes, I am using 3.20.  I will attempt to revert to 3.19 and rerun my test.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Gisle Aas
Sent: Tuesday, April 10, 2001 10:49 AM
To: Curt Powell
Cc: [EMAIL PROTECTED]
Subject: Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request
under RH Linux 6.1/7


Gisle Aas [EMAIL PROTECTED] writes:

 "Curt Powell" [EMAIL PROTECTED] writes:

  I have encountered a memory leak under RedHat Linux (versions 6.1 and
7),
  Perl 5.005 and 5.6.  It occurs with multiple calls to LWP::UserAgent and
  HTTP::Request.  Following is a short script that demonstrates the
problem.
  On RH7 it shows memory deltas of 8k every 10 or so iterations after the
  first iteration.  The amount leaked doesn't seem to be related to the
size
  of the page downloaded.  Is it possible that I am not doing the call
  sequence correctly?

 Seems good enough to me.  I also see memory leaking here.
 I'll try to investigate.

Did you use HTML-Parser 3.20 for your test?

The memory leak went away for me when I downgraded to HTML-Parser 3.19.

Regards,
Gisle


  #!/usr/bin/perl
  #usage: ./memtest url e.g. ./memtest http://www.sierraridge.com
 
  sub geturl()
  {
  use LWP::UserAgent;
  use HTTP::Request;
  my $URL = shift;
  my $UA = LWP::UserAgent-new();
  my $Request = HTTP::Request-new(GET = $URL);
  my $Response = $UA-request($Request);
  print "Error retrieving $URL\n" if ($Response-is_error());
  return $Response-as_string;
  }
 
  sub memused
  {
  local *memused_TMP_FILE;
  open(memused_TMP_FILE, "/proc/$$/stat");
  my $a = memused_TMP_FILE;
  close memused_TMP_FILE;
  my @b = split(' ', $a);
  return $b[22];
  }
 
  $url = shift ARGV;
  $lastused = memused();
  for ($i=0; $i=100; ++$i)
  {
  $length = length(geturl($url));
  $used = memused;
  $delta = $used - $lastused;
  print "$i: response length: $length memory used: $used memory change:
  $delta\n";
  $lastused = $used;
  }




Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7

2001-04-10 Thread Gisle Aas

"Curt Powell" [EMAIL PROTECTED] writes:

 Yes, I am using 3.20.  I will attempt to revert to 3.19 and rerun my test.

This patch fixes the leak in 3.20.  Expect to see HTML-Parser-3.21
pretty soon :-(

Regards,
Gisle


Index: hparser.c
===
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.67
retrieving revision 2.68
diff -u -p -u -r2.67 -r2.68
--- hparser.c   2001/04/06 20:03:24 2.67
+++ hparser.c   2001/04/10 18:33:27 2.68
@@ -1,4 +1,4 @@
-/* $Id: hparser.c,v 2.67 2001/04/06 20:03:24 gisle Exp $
+/* $Id: hparser.c,v 2.68 2001/04/10 18:33:27 gisle Exp $
  *
  * Copyright 1999-2001, Gisle Aas
  * Copyright 1999-2000, Michael A. Chase
@@ -243,6 +243,7 @@ report_event(PSTATE* p_state,
SvREFCNT_dec(tagname);
return;
}
+   SvREFCNT_dec(tagname);
}
else if (p_state-ignoring_element) {
return;




Re: still having problems with connecting to wells fargo web site

2001-04-10 Thread Kingpin

 It seems weird that they would do this. why not just a redirect? Anyway

They want to make it hard for people to write spiders which bypass
their pretty websites which they spent millions of dollars
designing!

-- 
 - - Martin "Kingpin" Thurn[EMAIL PROTECTED]
 Research Software Engineer   (703) 793-3700 x2651
 The Information Refinery  http://tir.tasc.com
 TASC, Inc.http://www.tasc.com

Don't give in to hate; that leads to the dark side. -- Ben, The Empire Strikes Back



RE: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7

2001-04-10 Thread Curt Powell

I applied the patch and it seems to work.  Thanks!

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Gisle Aas
Sent: Tuesday, April 10, 2001 11:36 AM
To: Curt Powell
Cc: [EMAIL PROTECTED]
Subject: Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request
under RH Linux 6.1/7


"Curt Powell" [EMAIL PROTECTED] writes:

 Yes, I am using 3.20.  I will attempt to revert to 3.19 and rerun my test.

This patch fixes the leak in 3.20.  Expect to see HTML-Parser-3.21
pretty soon :-(

Regards,
Gisle


Index: hparser.c
===
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.67
retrieving revision 2.68
diff -u -p -u -r2.67 -r2.68
--- hparser.c   2001/04/06 20:03:24 2.67
+++ hparser.c   2001/04/10 18:33:27 2.68
@@ -1,4 +1,4 @@
-/* $Id: hparser.c,v 2.67 2001/04/06 20:03:24 gisle Exp $
+/* $Id: hparser.c,v 2.68 2001/04/10 18:33:27 gisle Exp $
  *
  * Copyright 1999-2001, Gisle Aas
  * Copyright 1999-2000, Michael A. Chase
@@ -243,6 +243,7 @@ report_event(PSTATE* p_state,
SvREFCNT_dec(tagname);
return;
}
+   SvREFCNT_dec(tagname);
}
else if (p_state-ignoring_element) {
return;





Re: Crypt::SSLeay 0.23 Client-Cert Patch

2001-04-10 Thread Tobias Manthey

:-)) Not really a silly question. Let me say it this way: You currently
can't currently use the features with LWP. We just downloaded LWP 5.50 and at a
first glance the Crypt::SSLeay support seems to be poor. (Correct me if I am
wrong, perhaps I did not catch the clue) We are currently working on a patch
for LWP 5.50 to bring full support for the implemented features to LWP. I'll
let know as soon as it is available.

So far you can try LWP-HTTPS-PROXY-CERT-PATCH.tar.gz which can also be found
at http://www.progredy.de/download which offers clients cert support for
Crypt::SSLeay and integrates it into LWP. But this set of patches is based on a
whole set of outdated perlmodules, therefore I really can't recommend that
(e.g. Crypt::SSLeay 0.17, LWP 5.48).

The best thing seems so wait until further notice. The time frame is
currently the end of next week.
Regards,
Tobias


 Maybe this is a silly question or I am just overlooking something.
 
 I implemented the patch of Crypt::SSLeay and it seems to work fine 
 for Net::SSL (at least net_ssl_test works), however does this allow 
 me to use LWP to connect to a site requiring a client certificate? If 
 so, how do I tell LWP::UserAgent or HTTP::Request where to find the 
 client certificate and key?
 
 Kees Vonk
 
 __
 
 The views expressed in this email are not necessarily the views of 
 Transco plc, and the company, its directors, officers or employees 
 make no representation or accept any liability for its accuracy or 
 completeness unless expressly stated to the contrary. 
 
 This e-mail, and any attachments are strictly confidential and 
 intended for the addressee(s) only. The content may also contain 
 legal, professional or other privileged information. If you are not 
 the intended recipient, could you please notify the sender 
 immediately and then delete the e-mail and any attachments, you 
 should not disclose, copy or take any action in reliance of this 
 transmission.  Unless expressly stated to the contrary, no contracts 
 may be concluded on behalf of Transco plc by means of e-mail 
 communication.
 
 You may report the matter by calling us on  +44 (0)1455 230999.
 
 You should not copy, forward or otherwise disclose the contents of 
 this e-mail or any of its attachments without express consent.
 
 Please ensure you have adequate virus protection before you open or 
 detach any documents from this transmission.  Transco plc does not 
 accept any liability for viruses.
 
 Transco plc is part of Lattice Group
 Transco plc is registered in England: Company number: 2006000
 Registered Office: 130 Jermyn Street, London, SW1Y 4UR
 http://www.transco.uk.com
 

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: Bug Report

2001-04-10 Thread Bjoern Hoehrmann

* Laurent Simonneau wrote:
Why the character '|' is converted to '%7C' in URLs ?

RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt) defines it as an unwise
character, they must be escaped in URIs.

Exemple : 
libwww-perl send :

GET http://www.lycos.fr/cgi-bin/nph-bounce?LIA14%7C/service/sms/
HTTP/1.0

And the server reply a 404 not found error.

That's a bug, maybe Lycos people should get better software, libwww-perl
behaves correctly.
-- 
Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7 ° Telefon: +49(0)4667/981028 ° http://bjoern.hoehrmann.de
25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus