Oleg-

I noticed that javadocs reccomend using getResponseBodyAsStream, and I changed my get code as follows:

private String getPageApache(URL pageURL, ArrayList unProcessed, ArrayList tooBig) {
 SaveURL saveURL = new SaveURL();
 HttpMethod method = null;
 HttpURLConnection urlConnection = null;
 BufferedReader br  = null;
 InputStream is = null;
 InputStreamReader ir = null;
 String rawPage = new String();
 try {
  method = new GetMethod(getExternalForm(pageURL));
  method.setFollowRedirects(true);
  method.setRequestHeader("Content-type", "text/html");
  int statusCode = httpClient.executeMethod(method);
//   urlConnection = new HttpURLConnection(method,
//     pageURL);
  logger.debug("Requesting: "+pageURL.toExternalForm());

  is = (InputStream) method.getResponseBodyAsStream();
  if(is == null) {
   return null;
  }

  ir = new InputStreamReader(is);

  char[] cbuf = new char[1024];
  int len = 0;
  while((len = ir.read(cbuf)) > 0) {
   rawPage += new String(cbuf);
   if(rawPage.getBytes().length > (100000)) {
logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb, skipping.");
    tooBig.add(pageURL);
    return null;
   }
  }
//   br= new BufferedReader(new InputStreamReader(is));
//   String tmp = new String();
//   while((tmp = br.readLine()) != null) {
//    rawPage += tmp;
//    if(rawPage.getBytes().length > (100000)) {
// logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb, skipping.");
//     tooBig.add(pageURL);
//     return null;
//    }
//   }
//System.out.println("Page size: "+rawPage.getBytes().length+"Memory used:"+Runtime.getRuntime().freeMemory() +" Memory usage (amount left): "+(Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()));
  //rawPage = method.getResponseBodyAsString();
  //rawPage = saveURL.getURL(urlConnection);
  if(rawPage == null){
   unProcessed.add(pageURL);
  }
  return rawPage;
 } catch (IllegalArgumentException e) {
  //e.printStackTrace();

 }
 catch (HttpException e) {

  //e.printStackTrace();
 } catch (IOException e) {
  unProcessed.add(pageURL);
  //e.printStackTrace();
 }finally {
  try {
   if(ir != null) {
    ir.close();
   }
  } catch (IOException e3) {
   e3.printStackTrace();
  }
  try {
   if(is != null)
    is.close();
  } catch (IOException e2) {
   // TODO Auto-generated catch block
   e2.printStackTrace();
  }
  try {
   if(br != null)
    br.close();
  } catch (IOException e1) {
   // TODO Auto-generated catch block
   e1.printStackTrace();
  }
  if(method != null) {
   method.releaseConnection();
  }
  try {
   if(urlConnection != null) {
    if(urlConnection.getInputStream() != null) {
     urlConnection.getInputStream().close();
    }
   }
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  urlConnection = null;
  method = null;
 }
 return null;
}


So I now quit on any file over 100 kb. My memory still steadily rises and now Jprofiler claims:

   a.. 72.7% - 102,068 kB - 2,463 alloc. java.lang.StringBuffer.toString()
from getPageApache(). That means that somewhere in the above code (including calls into HttpClient) I am eating up a lot of memory in StringBuffers.

I don't ostensibly have any StringBuffers in my code, but suspected that one was used behind the scenes in BufferedReader, so in the commented code above you can see where I tried using a straight InputStreamReader, the memory still steadily rose until it failed and JProfiler reported the same as above that a StringBuffer was running away. There is little chance that the StringBuffer is in my code, as far as I can see.

I am asking for more help as this does not seem to just be a problem with using getResponseBodyAsString.

James

----- Original Message ----- From: "Oleg Kalnichevski" <[EMAIL PROTECTED]>
To: "HttpClient User Discussion" <[email protected]>
Sent: Tuesday, March 14, 2006 5:57 AM
Subject: Re: Memory leak using httpclient


On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote:
Hi-

I am using httpclient in a multi-threaded webcrawler application. I am using the MulitThreadedHttpConnectionManager in conjunction with 300 threads that download pages from various sites.

Problem is that I am running out of memory shortly after the process begins. I used JProfiler to analyze the memory stacks and it points to: a.. 76.2% - 233,587 kB - 6,626 alloc. org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString as the culprit (at most there should be a little over 300 allocations as there are 300 threads operating at once). Other relevant information, I am on a Windows XP Pro platform using the SUN JRE that came with jdk1.5.0_06. I am using commons-httpclient-3.0.jar.


James,

There's no memory leak in HttpClient. Just do not use
HttpMethod#getResponseBodyAsString() method which is not intended to
retrieval of response entities of arbitrary length, because it buffers
the entire response content in memory in order to to convert it a
String. If your crawler hits a site that generates an endless stream of
garbage the JVM is bound to run out of memory.

Use getResponseBodyAsStream() instead.

Hope this helps

Oleg

Here is the code where I initialize the HttpClient:

private HttpClient httpClient;

public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver, int maxThreads, String flag,
   boolean filter, String filterString, String dbType) {
  this.qt = qt;
  this.receiver = receiver;
  this.maxThreads = maxThreads;
  this.flag = flag;
  this.filter = filter;
  this.filterString = filterString;
  this.dbType = dbType;
  threads = new ArrayList();
  lastStatus = new HashMap();

  HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
  htcmp.setMaxTotalConnections(maxThreads);
  htcmp.setDefaultMaxConnectionsPerHost(10);
  htcmp.setSoTimeout(5000);
MultiThreadedHttpConnectionManager mtcm = new MultiThreadedHttpConnectionManager();
  mtcm.setParams(htcmp);
  httpClient = new HttpClient(mtcm);


 }

The client reference to httpClient is then passed to all the crawling threads where it is used as follows:

private String getPageApache(URL pageURL, ArrayList unProcessed) {
  SaveURL saveURL = new SaveURL();
  HttpMethod method = null;
  HttpURLConnection urlConnection = null;
  String rawPage = "";
  try {
   method = new GetMethod(pageURL.toExternalForm());
   method.setFollowRedirects(true);
   method.setRequestHeader("Content-type", "text/html");
   int statusCode = httpClient.executeMethod(method);
//   urlConnection = new HttpURLConnection(method,
//     pageURL);
   logger.debug("Requesting: "+pageURL.toExternalForm());


   rawPage = method.getResponseBodyAsString();
   //rawPage = saveURL.getURL(urlConnection);
   if(rawPage == null){
    unProcessed.add(pageURL);
   }
   return rawPage;
  } catch (IllegalArgumentException e) {
   //e.printStackTrace();

  }
  catch (HttpException e) {

   //e.printStackTrace();
  } catch (IOException e) {
   unProcessed.add(pageURL);
   //e.printStackTrace();
  }finally {
   if(method != null) {
    method.releaseConnection();
   }
   try {
    if(urlConnection != null) {
     if(urlConnection.getInputStream() != null) {
      urlConnection.getInputStream().close();
     }
    }
   } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
   }
   urlConnection = null;
   method = null;
  }
  return null;
 }

As you can see, I release the connection in the finally statement, so that should not be a problem. Upon running the getPageApache above the returned page as a string is processed and then set to null for garbage collection. I have been playing with this, closing streams, using HttpUrlConnection instead of the GetMethod, and I cannot find the answer. Indeed it seems the answer does not lie in my code.

I greatly appreciate any help that anyone can give me, I am at the end of my ropes with this one.

James


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to