On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote: 
> Hi-
> 
> I am using httpclient in a multi-threaded webcrawler application.  I am using 
> the MulitThreadedHttpConnectionManager in conjunction with 300 threads that 
> download pages from various sites.
> 
> Problem is that I am running out of memory shortly after the process begins.  
> I used JProfiler to analyze the memory stacks and it points to:
>   a.. 76.2% - 233,587 kB - 6,626 alloc. 
> org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString 
> as the culprit (at most there should be a little over 300 allocations as 
> there are 300 threads operating at once).  Other relevant information, I am 
> on a Windows XP Pro platform using the SUN JRE that came with jdk1.5.0_06.  I 
> am using commons-httpclient-3.0.jar.
> 

James,

There's no memory leak in HttpClient. Just do not use
HttpMethod#getResponseBodyAsString() method which is not intended to
retrieval of response entities of arbitrary length, because it buffers
the entire response content in memory in order to to convert it a
String. If your crawler hits a site that generates an endless stream of
garbage the JVM is bound to run out of memory.

Use getResponseBodyAsStream() instead.

Hope this helps

Oleg

> Here is the code where I initialize the HttpClient:
> 
> private HttpClient httpClient; 
>  
>  public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver, int 
> maxThreads, String flag,
>    boolean filter, String filterString, String dbType) {
>   this.qt = qt;
>   this.receiver = receiver;
>   this.maxThreads = maxThreads;
>   this.flag = flag;
>   this.filter = filter;
>   this.filterString = filterString;
>   this.dbType = dbType;
>   threads = new ArrayList();
>   lastStatus = new HashMap();
>   
>   HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
>   htcmp.setMaxTotalConnections(maxThreads);
>   htcmp.setDefaultMaxConnectionsPerHost(10);
>   htcmp.setSoTimeout(5000);
>   MultiThreadedHttpConnectionManager mtcm = new 
> MultiThreadedHttpConnectionManager();
>   mtcm.setParams(htcmp);
>   httpClient = new HttpClient(mtcm);
>   
>   
>  }
> 
> The client reference to httpClient is then passed to all the crawling threads 
> where it is used as follows:
> 
> private String getPageApache(URL pageURL, ArrayList unProcessed) {
>   SaveURL saveURL = new SaveURL();
>   HttpMethod method = null;
>   HttpURLConnection urlConnection = null;
>   String rawPage = "";
>   try {
>    method = new GetMethod(pageURL.toExternalForm());
>    method.setFollowRedirects(true);
>    method.setRequestHeader("Content-type", "text/html");
>    int statusCode = httpClient.executeMethod(method);
> //   urlConnection = new HttpURLConnection(method,
> //     pageURL);
>    logger.debug("Requesting: "+pageURL.toExternalForm());
> 
>    
>    rawPage = method.getResponseBodyAsString();
>    //rawPage = saveURL.getURL(urlConnection);
>    if(rawPage == null){
>     unProcessed.add(pageURL);
>    } 
>    return rawPage;
>   } catch (IllegalArgumentException e) {
>    //e.printStackTrace();
>    
>   } 
>   catch (HttpException e) {
>    
>    //e.printStackTrace();
>   } catch (IOException e) {
>    unProcessed.add(pageURL);
>    //e.printStackTrace();
>   }finally {
>    if(method != null) {
>     method.releaseConnection();
>    }
>    try {
>     if(urlConnection != null) {
>      if(urlConnection.getInputStream() != null) {
>       urlConnection.getInputStream().close();
>      }
>     }
>    } catch (IOException e) {
>     // TODO Auto-generated catch block
>     e.printStackTrace();
>    }
>    urlConnection = null;
>    method = null;
>   }
>   return null;
>  }
> 
> As you can see, I release the connection in the finally statement, so that 
> should not be a problem. Upon running the getPageApache above the returned 
> page as a string is processed and then set to null for garbage collection. I 
> have been playing with this, closing streams, using HttpUrlConnection instead 
> of the GetMethod, and I cannot find the answer.  Indeed it seems the answer 
> does not lie in my code.  
> 
> I greatly appreciate any help that anyone can give me, I am at the end of my 
> ropes with this one.
> 
> James


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to