Usage: HTML::Parser::parse(self, chunk)
Hi! I have written a program in Perl which reads URL's from file and then make some simple analysis of them and extract all the links. Links are written in file results.net, info about URL in results.out. I run the program, and after analysing several thousent of pages, I got an error: === Uncaught exception from user code: Usage: HTML::Parser::parse(self, chunk) at jure.pl line ---. main::analyse('http://alibaba.ijs.si/ME/CD/docs/CES1/dtd2html/cesAna/distrib uto...') called at jure.pl line 103 === Here is the program: #!/usr/bin/perl -w use CGI; use HTML::LinkExtor; use LWP::Simple; use HTTP::Response; use LWP; use LWP::UserAgent; use URI::URL; use strict; use diagnostics; # HTTP Response Codes on Has(c)h (RFC2068) ### my %statuscode = ( 100 = 'Continue', 101 = 'Switching Protocols', 200 = 'OK', 201 = 'Created', 202 = 'Accepted', 203 = 'Non-Authoritative Information', 204 = 'No Content', 205 = 'Reset Content', 206 = 'Partial Content', 300 = 'Multiple Choices', 301 = 'Moved Permanently', 302 = 'Moved Temporarily', 303 = 'See Other', 304 = 'Not Modified', 305 = 'Use Proxy', 400 = 'Bad Request', 401 = 'Unauthorized', 402 = 'Payment Required', 403 = 'Forbidden', 404 = 'Not Found', 405 = 'Method Not Allowed', 406 = 'Not Acceptable', 407 = 'Proxy Authentication Required', 408 = 'Request Time-out', 409 = 'Conflict', 410 = 'Gone', 411 = 'Length Required', 412 = 'Precondition Failed', 413 = 'Request Entity Too Large', 414 = 'Request-URI Too Large', 415 = 'Unsupported Media Type', 500 = 'Internal Server Error', 501 = 'Not Implemented', 502 = 'Bad Gateway', 503 = 'Service Unavailable', 504 = 'Gateway Time-out', 505 = 'HTTP Version not supported' ); my $filesdir = "files"; $| = 1; # print "Content-type: text/html\n\n"; BEGIN { open(OUT,"results/result.out") || print "Error!"; open(NET,"results/result.net") || print "Error!"; } END { close(NET); close(OUT); } my $file = "si-url.txt"; if ($file ne '') { print "Analysing file: $file"; open(FILE,"$filesdir/$file") || print "Error - no file $file."; local $\ = undef; my @file = FILE; close(FILE); print "START:\n"; my $count = 1; foreach my $main_line (@file) { $/="\r\n"; chomp($main_line); $/="\n"; my $base_url = "$main_line"; print "Line number: $count\n"; if ($count 1665) { analyse($base_url); } $count++; } ### # Analyse page! ### sub analyse { my $url=shift; chomp($url); my $browser = LWP::UserAgent-new(); $browser-agent("MatejKovacicGregaPetric/InternetResearchProject"); my $webdoc = $browser-request(HTTP::Request-new(GET = $url)); my $responsecode = $webdoc-code; if ($webdoc-is_success) { # COUNT images @main::images=(); @main::images = $webdoc-content =~ m{ \s* img \s* ["|']? ( [^\s'"]+ ) ['|"]? }xgm; $main::num_img=@main::images; my $string = $webdoc-content; $string =~ s/\n//g; # remove all newlines! $string =~ s/\r//g; # remove all carridge returns! $string =~ s/\t//g; # remove all tabs! # LENGTH of page my $size1 = length($string); # get the size of a string! $string =~ s/([^]|\n)*//g; # remove all HTML tags $string =~ s/ //g; # remove all double spaces! # LENGTH of text on the page my $size2 = length($string); # get the size of a shortened string! print OUT "$url $responsecode $size1 $size2 ",$webdoc-base, " ",$webdoc-content_type," ",$webdoc-title, " $main::num_img\n"; # EXTRACT links $main::base_url = $webdoc-base; my $parser = HTML::LinkExtor-new(undef, $main::base_url); $parser-parse(get($main::base_url))-eof; @main::links = $parser-links; my %seen; foreach my $linkarray (@main::links) { @main::element=(); @main::element = @$linkarray; my $elt_type = shift @main::element; while (@main::element) { my ($attr_name , $attr_value) = splice(@main::element, 0, 2); $seen{$attr_value}++; } } @main::arr = sort keys %seen; my $i = 0; for (sort keys %seen) { print NET "$main::base_url - $main::arr[$i]\n"; $i++; } } else { print OUT "$url $responsecode . . . . . .\n"; } } Do you have any idea how to get the solution of this problem? I have to tell you I am quite new in Perl programming... but I am trying hard. Thanks in advance. bye, Matej
Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7
Dear sir or madam, I have encountered a memory leak under RedHat Linux (versions 6.1 and 7), Perl 5.005 and 5.6. It occurs with multiple calls to LWP::UserAgent and HTTP::Request. Following is a short script that demonstrates the problem. On RH7 it shows memory deltas of 8k every 10 or so iterations after the first iteration. The amount leaked doesn't seem to be related to the size of the page downloaded. Is it possible that I am not doing the call sequence correctly? Regards, Curt Powell #!/usr/bin/perl #usage: ./memtest url e.g. ./memtest http://www.sierraridge.com sub geturl() { use LWP::UserAgent; use HTTP::Request; my $URL = shift; my $UA = LWP::UserAgent-new(); my $Request = HTTP::Request-new(GET = $URL); my $Response = $UA-request($Request); print "Error retrieving $URL\n" if ($Response-is_error()); return $Response-as_string; } sub memused { local *memused_TMP_FILE; open(memused_TMP_FILE, "/proc/$$/stat"); my $a = memused_TMP_FILE; close memused_TMP_FILE; my @b = split(' ', $a); return $b[22]; } $url = shift ARGV; $lastused = memused(); for ($i=0; $i=100; ++$i) { $length = length(geturl($url)); $used = memused; $delta = $used - $lastused; print "$i: response length: $length memory used: $used memory change: $delta\n"; $lastused = $used; }
Re: Usage: HTML::Parser::parse(self, chunk)
"Matej Kovacic" [EMAIL PROTECTED] writes: Uncaught exception from user code: Usage: HTML::Parser::parse(self, chunk) at jure.pl line ---. [...] # HTTP Response Codes on Has(c)h (RFC2068) ### my %statuscode = ( 100 = 'Continue', 101 = 'Switching Protocols', 200 = 'OK', 201 = 'Created', BTW, this data is also available from the HTTP::Status module. my $parser = HTML::LinkExtor-new(undef, $main::base_url); $parser-parse(get($main::base_url))-eof; The error probably shows up because get() returns nothing instead of undef. I think the following patch should fix the problem: Index: lib/LWP/Simple.pm === RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/Simple.pm,v retrieving revision 1.33 diff -u -p -u -r1.33 Simple.pm --- lib/LWP/Simple.pm 2000/05/24 09:40:43 1.33 +++ lib/LWP/Simple.pm 2001/04/10 17:12:27 @@ -298,7 +298,7 @@ sub _trivial_http_get my $sock = IO::Socket::INET-new(PeerAddr = $host, PeerPort = $port, Proto= 'tcp', -Timeout = 60) || return; +Timeout = 60) || return undef; $sock-autoflush; my $netloc = $host; $netloc .= ":$port" if $port != 80; Regards, Gisle
Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7
"Curt Powell" [EMAIL PROTECTED] writes: I have encountered a memory leak under RedHat Linux (versions 6.1 and 7), Perl 5.005 and 5.6. It occurs with multiple calls to LWP::UserAgent and HTTP::Request. Following is a short script that demonstrates the problem. On RH7 it shows memory deltas of 8k every 10 or so iterations after the first iteration. The amount leaked doesn't seem to be related to the size of the page downloaded. Is it possible that I am not doing the call sequence correctly? Seems good enough to me. I also see memory leaking here. I'll try to investigate. Regards, Gisle Regards, Curt Powell #!/usr/bin/perl #usage: ./memtest url e.g. ./memtest http://www.sierraridge.com sub geturl() { use LWP::UserAgent; use HTTP::Request; my $URL = shift; my $UA = LWP::UserAgent-new(); my $Request = HTTP::Request-new(GET = $URL); my $Response = $UA-request($Request); print "Error retrieving $URL\n" if ($Response-is_error()); return $Response-as_string; } sub memused { local *memused_TMP_FILE; open(memused_TMP_FILE, "/proc/$$/stat"); my $a = memused_TMP_FILE; close memused_TMP_FILE; my @b = split(' ', $a); return $b[22]; } $url = shift ARGV; $lastused = memused(); for ($i=0; $i=100; ++$i) { $length = length(geturl($url)); $used = memused; $delta = $used - $lastused; print "$i: response length: $length memory used: $used memory change: $delta\n"; $lastused = $used; }
Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7
Gisle Aas [EMAIL PROTECTED] writes: "Curt Powell" [EMAIL PROTECTED] writes: I have encountered a memory leak under RedHat Linux (versions 6.1 and 7), Perl 5.005 and 5.6. It occurs with multiple calls to LWP::UserAgent and HTTP::Request. Following is a short script that demonstrates the problem. On RH7 it shows memory deltas of 8k every 10 or so iterations after the first iteration. The amount leaked doesn't seem to be related to the size of the page downloaded. Is it possible that I am not doing the call sequence correctly? Seems good enough to me. I also see memory leaking here. I'll try to investigate. Did you use HTML-Parser 3.20 for your test? The memory leak went away for me when I downgraded to HTML-Parser 3.19. Regards, Gisle #!/usr/bin/perl #usage: ./memtest url e.g. ./memtest http://www.sierraridge.com sub geturl() { use LWP::UserAgent; use HTTP::Request; my $URL = shift; my $UA = LWP::UserAgent-new(); my $Request = HTTP::Request-new(GET = $URL); my $Response = $UA-request($Request); print "Error retrieving $URL\n" if ($Response-is_error()); return $Response-as_string; } sub memused { local *memused_TMP_FILE; open(memused_TMP_FILE, "/proc/$$/stat"); my $a = memused_TMP_FILE; close memused_TMP_FILE; my @b = split(' ', $a); return $b[22]; } $url = shift ARGV; $lastused = memused(); for ($i=0; $i=100; ++$i) { $length = length(geturl($url)); $used = memused; $delta = $used - $lastused; print "$i: response length: $length memory used: $used memory change: $delta\n"; $lastused = $used; }
RE: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7
Yes, I am using 3.20. I will attempt to revert to 3.19 and rerun my test. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Gisle Aas Sent: Tuesday, April 10, 2001 10:49 AM To: Curt Powell Cc: [EMAIL PROTECTED] Subject: Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7 Gisle Aas [EMAIL PROTECTED] writes: "Curt Powell" [EMAIL PROTECTED] writes: I have encountered a memory leak under RedHat Linux (versions 6.1 and 7), Perl 5.005 and 5.6. It occurs with multiple calls to LWP::UserAgent and HTTP::Request. Following is a short script that demonstrates the problem. On RH7 it shows memory deltas of 8k every 10 or so iterations after the first iteration. The amount leaked doesn't seem to be related to the size of the page downloaded. Is it possible that I am not doing the call sequence correctly? Seems good enough to me. I also see memory leaking here. I'll try to investigate. Did you use HTML-Parser 3.20 for your test? The memory leak went away for me when I downgraded to HTML-Parser 3.19. Regards, Gisle #!/usr/bin/perl #usage: ./memtest url e.g. ./memtest http://www.sierraridge.com sub geturl() { use LWP::UserAgent; use HTTP::Request; my $URL = shift; my $UA = LWP::UserAgent-new(); my $Request = HTTP::Request-new(GET = $URL); my $Response = $UA-request($Request); print "Error retrieving $URL\n" if ($Response-is_error()); return $Response-as_string; } sub memused { local *memused_TMP_FILE; open(memused_TMP_FILE, "/proc/$$/stat"); my $a = memused_TMP_FILE; close memused_TMP_FILE; my @b = split(' ', $a); return $b[22]; } $url = shift ARGV; $lastused = memused(); for ($i=0; $i=100; ++$i) { $length = length(geturl($url)); $used = memused; $delta = $used - $lastused; print "$i: response length: $length memory used: $used memory change: $delta\n"; $lastused = $used; }
Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7
"Curt Powell" [EMAIL PROTECTED] writes: Yes, I am using 3.20. I will attempt to revert to 3.19 and rerun my test. This patch fixes the leak in 3.20. Expect to see HTML-Parser-3.21 pretty soon :-( Regards, Gisle Index: hparser.c === RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v retrieving revision 2.67 retrieving revision 2.68 diff -u -p -u -r2.67 -r2.68 --- hparser.c 2001/04/06 20:03:24 2.67 +++ hparser.c 2001/04/10 18:33:27 2.68 @@ -1,4 +1,4 @@ -/* $Id: hparser.c,v 2.67 2001/04/06 20:03:24 gisle Exp $ +/* $Id: hparser.c,v 2.68 2001/04/10 18:33:27 gisle Exp $ * * Copyright 1999-2001, Gisle Aas * Copyright 1999-2000, Michael A. Chase @@ -243,6 +243,7 @@ report_event(PSTATE* p_state, SvREFCNT_dec(tagname); return; } + SvREFCNT_dec(tagname); } else if (p_state-ignoring_element) { return;
Re: still having problems with connecting to wells fargo web site
It seems weird that they would do this. why not just a redirect? Anyway They want to make it hard for people to write spiders which bypass their pretty websites which they spent millions of dollars designing! -- - - Martin "Kingpin" Thurn[EMAIL PROTECTED] Research Software Engineer (703) 793-3700 x2651 The Information Refinery http://tir.tasc.com TASC, Inc.http://www.tasc.com Don't give in to hate; that leads to the dark side. -- Ben, The Empire Strikes Back
RE: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7
I applied the patch and it seems to work. Thanks! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Gisle Aas Sent: Tuesday, April 10, 2001 11:36 AM To: Curt Powell Cc: [EMAIL PROTECTED] Subject: Re: Bug report: Memory leak with LWP::UserAgent/HTTP::Request under RH Linux 6.1/7 "Curt Powell" [EMAIL PROTECTED] writes: Yes, I am using 3.20. I will attempt to revert to 3.19 and rerun my test. This patch fixes the leak in 3.20. Expect to see HTML-Parser-3.21 pretty soon :-( Regards, Gisle Index: hparser.c === RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v retrieving revision 2.67 retrieving revision 2.68 diff -u -p -u -r2.67 -r2.68 --- hparser.c 2001/04/06 20:03:24 2.67 +++ hparser.c 2001/04/10 18:33:27 2.68 @@ -1,4 +1,4 @@ -/* $Id: hparser.c,v 2.67 2001/04/06 20:03:24 gisle Exp $ +/* $Id: hparser.c,v 2.68 2001/04/10 18:33:27 gisle Exp $ * * Copyright 1999-2001, Gisle Aas * Copyright 1999-2000, Michael A. Chase @@ -243,6 +243,7 @@ report_event(PSTATE* p_state, SvREFCNT_dec(tagname); return; } + SvREFCNT_dec(tagname); } else if (p_state-ignoring_element) { return;
Re: Crypt::SSLeay 0.23 Client-Cert Patch
:-)) Not really a silly question. Let me say it this way: You currently can't currently use the features with LWP. We just downloaded LWP 5.50 and at a first glance the Crypt::SSLeay support seems to be poor. (Correct me if I am wrong, perhaps I did not catch the clue) We are currently working on a patch for LWP 5.50 to bring full support for the implemented features to LWP. I'll let know as soon as it is available. So far you can try LWP-HTTPS-PROXY-CERT-PATCH.tar.gz which can also be found at http://www.progredy.de/download which offers clients cert support for Crypt::SSLeay and integrates it into LWP. But this set of patches is based on a whole set of outdated perlmodules, therefore I really can't recommend that (e.g. Crypt::SSLeay 0.17, LWP 5.48). The best thing seems so wait until further notice. The time frame is currently the end of next week. Regards, Tobias Maybe this is a silly question or I am just overlooking something. I implemented the patch of Crypt::SSLeay and it seems to work fine for Net::SSL (at least net_ssl_test works), however does this allow me to use LWP to connect to a site requiring a client certificate? If so, how do I tell LWP::UserAgent or HTTP::Request where to find the client certificate and key? Kees Vonk __ The views expressed in this email are not necessarily the views of Transco plc, and the company, its directors, officers or employees make no representation or accept any liability for its accuracy or completeness unless expressly stated to the contrary. This e-mail, and any attachments are strictly confidential and intended for the addressee(s) only. The content may also contain legal, professional or other privileged information. If you are not the intended recipient, could you please notify the sender immediately and then delete the e-mail and any attachments, you should not disclose, copy or take any action in reliance of this transmission. Unless expressly stated to the contrary, no contracts may be concluded on behalf of Transco plc by means of e-mail communication. You may report the matter by calling us on +44 (0)1455 230999. You should not copy, forward or otherwise disclose the contents of this e-mail or any of its attachments without express consent. Please ensure you have adequate virus protection before you open or detach any documents from this transmission. Transco plc does not accept any liability for viruses. Transco plc is part of Lattice Group Transco plc is registered in England: Company number: 2006000 Registered Office: 130 Jermyn Street, London, SW1Y 4UR http://www.transco.uk.com -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: Bug Report
* Laurent Simonneau wrote: Why the character '|' is converted to '%7C' in URLs ? RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt) defines it as an unwise character, they must be escaped in URIs. Exemple : libwww-perl send : GET http://www.lycos.fr/cgi-bin/nph-bounce?LIA14%7C/service/sms/ HTTP/1.0 And the server reply a 404 not found error. That's a bug, maybe Lycos people should get better software, libwww-perl behaves correctly. -- Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 ° Telefon: +49(0)4667/981028 ° http://bjoern.hoehrmann.de 25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- listen, learn, contribute -- David J. Marcus