Re: Memory leak using httpclient

James Ostheimer Tue, 14 Mar 2006 09:54:02 -0800

Oleg-

I noticed that javadocs reccomend using getResponseBodyAsStream, and Ichanged my get code as follows:

private String getPageApache(URL pageURL, ArrayList unProcessed, ArrayListtooBig) {

 SaveURL saveURL = new SaveURL();
 HttpMethod method = null;
 HttpURLConnection urlConnection = null;
 BufferedReader br  = null;
 InputStream is = null;
 InputStreamReader ir = null;
 String rawPage = new String();
 try {
  method = new GetMethod(getExternalForm(pageURL));
  method.setFollowRedirects(true);
  method.setRequestHeader("Content-type", "text/html");
  int statusCode = httpClient.executeMethod(method);
//   urlConnection = new HttpURLConnection(method,
//     pageURL);
  logger.debug("Requesting: "+pageURL.toExternalForm());

  is = (InputStream) method.getResponseBodyAsStream();
  if(is == null) {
   return null;
  }

  ir = new InputStreamReader(is);

  char[] cbuf = new char[1024];
  int len = 0;
  while((len = ir.read(cbuf)) > 0) {
   rawPage += new String(cbuf);
   if(rawPage.getBytes().length > (100000)) {

logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb,skipping.");

    tooBig.add(pageURL);
    return null;
   }
  }
//   br= new BufferedReader(new InputStreamReader(is));
//   String tmp = new String();
//   while((tmp = br.readLine()) != null) {
//    rawPage += tmp;
//    if(rawPage.getBytes().length > (100000)) {

// logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb,skipping.");

//     tooBig.add(pageURL);
//     return null;
//    }
//   }

//System.out.println("Page size: "+rawPage.getBytes().length+"Memoryused:"+Runtime.getRuntime().freeMemory() +" Memory usage (amount left):"+(Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()));

  //rawPage = method.getResponseBodyAsString();
  //rawPage = saveURL.getURL(urlConnection);
  if(rawPage == null){
   unProcessed.add(pageURL);
  }
  return rawPage;
 } catch (IllegalArgumentException e) {
  //e.printStackTrace();

 }
 catch (HttpException e) {

  //e.printStackTrace();
 } catch (IOException e) {
  unProcessed.add(pageURL);
  //e.printStackTrace();
 }finally {
  try {
   if(ir != null) {
    ir.close();
   }
  } catch (IOException e3) {
   e3.printStackTrace();
  }
  try {
   if(is != null)
    is.close();
  } catch (IOException e2) {
   // TODO Auto-generated catch block
   e2.printStackTrace();
  }
  try {
   if(br != null)
    br.close();
  } catch (IOException e1) {
   // TODO Auto-generated catch block
   e1.printStackTrace();
  }
  if(method != null) {
   method.releaseConnection();
  }
  try {
   if(urlConnection != null) {
    if(urlConnection.getInputStream() != null) {
     urlConnection.getInputStream().close();
    }
   }
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  urlConnection = null;
  method = null;
 }
 return null;
}

So I now quit on any file over 100 kb. My memory still steadily rises andnow Jprofiler claims:


   a.. 72.7% - 102,068 kB - 2,463 alloc. java.lang.StringBuffer.toString()

from getPageApache(). That means that somewhere in the above code(including calls into HttpClient) I am eating up a lot of memory inStringBuffers.

I don't ostensibly have any StringBuffers in my code, but suspected that onewas used behind the scenes in BufferedReader, so in the commented code aboveyou can see where I tried using a straight InputStreamReader, the memorystill steadily rose until it failed and JProfiler reported the same as abovethat a StringBuffer was running away. There is little chance that theStringBuffer is in my code, as far as I can see.

I am asking for more help as this does not seem to just be a problem withusing getResponseBodyAsString.


James

----- Original Message -----From: "Oleg Kalnichevski" <[EMAIL PROTECTED]>

To: "HttpClient User Discussion" <[email protected]>
Sent: Tuesday, March 14, 2006 5:57 AM
Subject: Re: Memory leak using httpclient

On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote:
Hi-
I am using httpclient in a multi-threaded webcrawler application. I amusing the MulitThreadedHttpConnectionManager in conjunction with 300threads that download pages from various sites.
Problem is that I am running out of memory shortly after the processbegins. I used JProfiler to analyze the memory stacks and it points to:a.. 76.2% - 233,587 kB - 6,626 alloc.org.apache.commons.httpclient.HttpMethod.getResponseBodyAsStringas the culprit (at most there should be a little over 300 allocations asthere are 300 threads operating at once). Other relevant information, Iam on a Windows XP Pro platform using the SUN JRE that came withjdk1.5.0_06. I am using commons-httpclient-3.0.jar.
James,

There's no memory leak in HttpClient. Just do not use
HttpMethod#getResponseBodyAsString() method which is not intended to
retrieval of response entities of arbitrary length, because it buffers
the entire response content in memory in order to to convert it a
String. If your crawler hits a site that generates an endless stream of
garbage the JVM is bound to run out of memory.

Use getResponseBodyAsStream() instead.

Hope this helps

Oleg
Here is the code where I initialize the HttpClient:

private HttpClient httpClient;
public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver,int maxThreads, String flag,
   boolean filter, String filterString, String dbType) {
  this.qt = qt;
  this.receiver = receiver;
  this.maxThreads = maxThreads;
  this.flag = flag;
  this.filter = filter;
  this.filterString = filterString;
  this.dbType = dbType;
  threads = new ArrayList();
  lastStatus = new HashMap();

  HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
  htcmp.setMaxTotalConnections(maxThreads);
  htcmp.setDefaultMaxConnectionsPerHost(10);
  htcmp.setSoTimeout(5000);
MultiThreadedHttpConnectionManager mtcm = newMultiThreadedHttpConnectionManager();
  mtcm.setParams(htcmp);
  httpClient = new HttpClient(mtcm);


 }
The client reference to httpClient is then passed to all the crawlingthreads where it is used as follows:
private String getPageApache(URL pageURL, ArrayList unProcessed) {
  SaveURL saveURL = new SaveURL();
  HttpMethod method = null;
  HttpURLConnection urlConnection = null;
  String rawPage = "";
  try {
   method = new GetMethod(pageURL.toExternalForm());
   method.setFollowRedirects(true);
   method.setRequestHeader("Content-type", "text/html");
   int statusCode = httpClient.executeMethod(method);
//   urlConnection = new HttpURLConnection(method,
//     pageURL);
   logger.debug("Requesting: "+pageURL.toExternalForm());


   rawPage = method.getResponseBodyAsString();
   //rawPage = saveURL.getURL(urlConnection);
   if(rawPage == null){
    unProcessed.add(pageURL);
   }
   return rawPage;
  } catch (IllegalArgumentException e) {
   //e.printStackTrace();

  }
  catch (HttpException e) {

   //e.printStackTrace();
  } catch (IOException e) {
   unProcessed.add(pageURL);
   //e.printStackTrace();
  }finally {
   if(method != null) {
    method.releaseConnection();
   }
   try {
    if(urlConnection != null) {
     if(urlConnection.getInputStream() != null) {
      urlConnection.getInputStream().close();
     }
    }
   } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
   }
   urlConnection = null;
   method = null;
  }
  return null;
 }
As you can see, I release the connection in the finally statement, sothat should not be a problem. Upon running the getPageApache above thereturned page as a string is processed and then set to null for garbagecollection. I have been playing with this, closing streams, usingHttpUrlConnection instead of the GetMethod, and I cannot find the answer.Indeed it seems the answer does not lie in my code.
I greatly appreciate any help that anyone can give me, I am at the end ofmy ropes with this one.
James
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Memory leak using httpclient

Reply via email to