I completely agree with you that we all should write standards-compliant HTTP
web pages/CGI programs/Servlets etc.
Unfortunately, it is not always in our hands. As nearly everybody can write
PHP scripts today, clients which attempt to read from them, should be
error-tolerant.
If you want to run a HTTP Client against an arbitrary number of URLs pointing
to unknown web servers/pages (as in my case - I am writing a web crawler),
you must be able to guarantee a deadlock-free, fault-tolerant way of reading
each page. Have you ever heard of "spider traps" ?
Let's have a simple HttpClient command constellation:
public void test() {
HttpClient client = new HttpClient();
HttpMethod m = new GetMethod("http://localhost/testfile.php");
client.executeMethod(m);
// --- bytes limit as suggested in discussion
InputStream body = m.getResponseBodyAsStream();
int limit = 10; // limit to first ten bytes
int i;
for(i=0;i<=limit;i++) {
int b = body.read();
if(b < 0) {
break;
}
}
System.err.println("EOF at byte "+i);
// ---
m.releaseConnection();
}
The following PHP scripts will cause HttpClient to loop endlessly (1) or to
hang (2):
Test 1: endless.php
<?php
set_time_limit(-1);
while(TRUE) {
print "The UNIX time is ".time()."<br>\n";
flush();
sleep(1);
}
?>
will cause the program hang at "m.releaseConnection()"
Test 2: hang-in-headers.php
<?php
// remember to set an adequate memory limit in php.ini
// or use Apache's "asis"-feature instead of PHP
$x = str_repeat("X",1024*1024*32); // send 32M of 'X'
Header("HTTP/1.0 300 Multiple Choices");
Header("Location: http://localhost/".$x);
?>
In this case you can even remove the getResponseBody()-stuff, it will crash
with OutOfMemoryError because 32M won't usually fit into the JVM's memory.
--
Christian Kohlsch�tter
[EMAIL PROTECTED]
http://www.newsclub.de - Der Meta-Nachrichten-Dienst
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]