problem with Multithreaded crawler using httpclient

Yang Sun Thu, 13 Sep 2007 14:54:13 -0700

Hi,

I am implementing a multithread crawler using httpclient. The fetchingtasks are managed by ThreadPoolExecutor.But I met a weired problem. The memory usage keeps increasing when eachnew task starts to run. Here's the code of the fetcher:


import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;

public class TestFetcher implements Runnable{
   String urlObj;
   HttpClient client;
   TestFetcher(HttpClient client, String url) {
       this.client = client;
       urlObj = url;
   }

public void run() {

       process(urlObj);
   }
   public synchronized void process(String url){
       HttpMethod method = new GetMethod(url);
       method.setFollowRedirects(true);
       String content = null;
       int fd= 0;
       try{
           client.executeMethod(method);
           Thread.sleep(1000);
           int code = method.getStatusCode();
           if(code == 200){
               content = method.getResponseBodyAsString();
           } else fd = 10+ code/100;
       } catch (Exception e) {
           fd = 10;
       } finally {
           method.releaseConnection();
           method = null;
       }
   }
}

And this is how I create new task:

while(true){

taskPool.execute(new TestFetcher(httpclient,urlPool.getTaskQueue().take()));

   while(some condition) Thread.sleep(delay);
}

I used to use HttpURLConnection do the fetching. There is no memoryproblem at all. The reason I want to use httpclient is because it cantake IP addresses instead of using domain names.


Please help.
Thanks,

Yang


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

problem with Multithreaded crawler using httpclient

Reply via email to