This is most certainly a problem with resources:
   while((len = ir.read(cbuf)) > 0) {
    rawPage += new String(cbuf);
That is doing java.lang.StringBuffer.toString() and so is this:
logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb, 
skipping.");




-----Original Message-----
From: James Ostheimer [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 14, 2006 12:54 PM
To: HttpClient User Discussion
Subject: Re: Memory leak using httpclient

Oleg-

I noticed that javadocs reccomend using getResponseBodyAsStream, and I 
changed my get code as follows:

private String getPageApache(URL pageURL, ArrayList unProcessed,
ArrayList 
tooBig) {
  SaveURL saveURL = new SaveURL();
  HttpMethod method = null;
  HttpURLConnection urlConnection = null;
  BufferedReader br  = null;
  InputStream is = null;
  InputStreamReader ir = null;
  String rawPage = new String();
  try {
   method = new GetMethod(getExternalForm(pageURL));
   method.setFollowRedirects(true);
   method.setRequestHeader("Content-type", "text/html");
   int statusCode = httpClient.executeMethod(method);
//   urlConnection = new HttpURLConnection(method,
//     pageURL);
   logger.debug("Requesting: "+pageURL.toExternalForm());

   is = (InputStream) method.getResponseBodyAsStream();
   if(is == null) {
    return null;
   }

   ir = new InputStreamReader(is);

   char[] cbuf = new char[1024];
   int len = 0;
   while((len = ir.read(cbuf)) > 0) {
    rawPage += new String(cbuf);
    if(rawPage.getBytes().length > (100000)) {
     logger.info("Page "+pageURL.toExternalForm()+"is greater than
100kb, 
skipping.");
     tooBig.add(pageURL);
     return null;
    }
   }
//   br= new BufferedReader(new InputStreamReader(is));
//   String tmp = new String();
//   while((tmp = br.readLine()) != null) {
//    rawPage += tmp;
//    if(rawPage.getBytes().length > (100000)) {
//     logger.info("Page "+pageURL.toExternalForm()+"is greater than
100kb, 
skipping.");
//     tooBig.add(pageURL);
//     return null;
//    }
//   }
   //System.out.println("Page size: "+rawPage.getBytes().length+"Memory 
used:"+Runtime.getRuntime().freeMemory() +" Memory usage (amount left): 
"+(Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())
);
   //rawPage = method.getResponseBodyAsString();
   //rawPage = saveURL.getURL(urlConnection);
   if(rawPage == null){
    unProcessed.add(pageURL);
   }
   return rawPage;
  } catch (IllegalArgumentException e) {
   //e.printStackTrace();

  }
  catch (HttpException e) {

   //e.printStackTrace();
  } catch (IOException e) {
   unProcessed.add(pageURL);
   //e.printStackTrace();
  }finally {
   try {
    if(ir != null) {
     ir.close();
    }
   } catch (IOException e3) {
    e3.printStackTrace();
   }
   try {
    if(is != null)
     is.close();
   } catch (IOException e2) {
    // TODO Auto-generated catch block
    e2.printStackTrace();
   }
   try {
    if(br != null)
     br.close();
   } catch (IOException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
   }
   if(method != null) {
    method.releaseConnection();
   }
   try {
    if(urlConnection != null) {
     if(urlConnection.getInputStream() != null) {
      urlConnection.getInputStream().close();
     }
    }
   } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
   }
   urlConnection = null;
   method = null;
  }
  return null;
 }


So I now quit on any file over 100 kb.  My memory still steadily rises
and 
now Jprofiler claims:

    a.. 72.7% - 102,068 kB - 2,463 alloc.
java.lang.StringBuffer.toString()
from getPageApache().  That means that somewhere in the above code 
(including calls into HttpClient) I am eating up a lot of memory in 
StringBuffers.

I don't ostensibly have any StringBuffers in my code, but suspected that
one 
was used behind the scenes in BufferedReader, so in the commented code
above 
you can see where I tried using a straight InputStreamReader, the memory

still steadily rose until it failed and JProfiler reported the same as
above 
that a StringBuffer was running away.  There is little chance that the 
StringBuffer is in my code, as far as I can see.

I am asking for more help as this does not seem to just be a problem
with 
using getResponseBodyAsString.

James

----- Original Message ----- 
From: "Oleg Kalnichevski" <[EMAIL PROTECTED]>
To: "HttpClient User Discussion" <[email protected]>
Sent: Tuesday, March 14, 2006 5:57 AM
Subject: Re: Memory leak using httpclient


> On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote:
>> Hi-
>>
>> I am using httpclient in a multi-threaded webcrawler application.  I
am 
>> using the MulitThreadedHttpConnectionManager in conjunction with 300 
>> threads that download pages from various sites.
>>
>> Problem is that I am running out of memory shortly after the process 
>> begins.  I used JProfiler to analyze the memory stacks and it points
to:
>>   a.. 76.2% - 233,587 kB - 6,626 alloc. 
>> org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString
>> as the culprit (at most there should be a little over 300 allocations
as 
>> there are 300 threads operating at once).  Other relevant
information, I 
>> am on a Windows XP Pro platform using the SUN JRE that came with 
>> jdk1.5.0_06.  I am using commons-httpclient-3.0.jar.
>>
>
> James,
>
> There's no memory leak in HttpClient. Just do not use
> HttpMethod#getResponseBodyAsString() method which is not intended to
> retrieval of response entities of arbitrary length, because it buffers
> the entire response content in memory in order to to convert it a
> String. If your crawler hits a site that generates an endless stream
of
> garbage the JVM is bound to run out of memory.
>
> Use getResponseBodyAsStream() instead.
>
> Hope this helps
>
> Oleg
>
>> Here is the code where I initialize the HttpClient:
>>
>> private HttpClient httpClient;
>>
>>  public CrawlerControllerThread(QueueThread qt, MessageReceiver
receiver, 
>> int maxThreads, String flag,
>>    boolean filter, String filterString, String dbType) {
>>   this.qt = qt;
>>   this.receiver = receiver;
>>   this.maxThreads = maxThreads;
>>   this.flag = flag;
>>   this.filter = filter;
>>   this.filterString = filterString;
>>   this.dbType = dbType;
>>   threads = new ArrayList();
>>   lastStatus = new HashMap();
>>
>>   HttpConnectionManagerParams htcmp = new
HttpConnectionManagerParams();
>>   htcmp.setMaxTotalConnections(maxThreads);
>>   htcmp.setDefaultMaxConnectionsPerHost(10);
>>   htcmp.setSoTimeout(5000);
>>   MultiThreadedHttpConnectionManager mtcm = new 
>> MultiThreadedHttpConnectionManager();
>>   mtcm.setParams(htcmp);
>>   httpClient = new HttpClient(mtcm);
>>
>>
>>  }
>>
>> The client reference to httpClient is then passed to all the crawling

>> threads where it is used as follows:
>>
>> private String getPageApache(URL pageURL, ArrayList unProcessed) {
>>   SaveURL saveURL = new SaveURL();
>>   HttpMethod method = null;
>>   HttpURLConnection urlConnection = null;
>>   String rawPage = "";
>>   try {
>>    method = new GetMethod(pageURL.toExternalForm());
>>    method.setFollowRedirects(true);
>>    method.setRequestHeader("Content-type", "text/html");
>>    int statusCode = httpClient.executeMethod(method);
>> //   urlConnection = new HttpURLConnection(method,
>> //     pageURL);
>>    logger.debug("Requesting: "+pageURL.toExternalForm());
>>
>>
>>    rawPage = method.getResponseBodyAsString();
>>    //rawPage = saveURL.getURL(urlConnection);
>>    if(rawPage == null){
>>     unProcessed.add(pageURL);
>>    }
>>    return rawPage;
>>   } catch (IllegalArgumentException e) {
>>    //e.printStackTrace();
>>
>>   }
>>   catch (HttpException e) {
>>
>>    //e.printStackTrace();
>>   } catch (IOException e) {
>>    unProcessed.add(pageURL);
>>    //e.printStackTrace();
>>   }finally {
>>    if(method != null) {
>>     method.releaseConnection();
>>    }
>>    try {
>>     if(urlConnection != null) {
>>      if(urlConnection.getInputStream() != null) {
>>       urlConnection.getInputStream().close();
>>      }
>>     }
>>    } catch (IOException e) {
>>     // TODO Auto-generated catch block
>>     e.printStackTrace();
>>    }
>>    urlConnection = null;
>>    method = null;
>>   }
>>   return null;
>>  }
>>
>> As you can see, I release the connection in the finally statement, so

>> that should not be a problem. Upon running the getPageApache above
the 
>> returned page as a string is processed and then set to null for
garbage 
>> collection. I have been playing with this, closing streams, using 
>> HttpUrlConnection instead of the GetMethod, and I cannot find the
answer. 
>> Indeed it seems the answer does not lie in my code.
>>
>> I greatly appreciate any help that anyone can give me, I am at the
end of 
>> my ropes with this one.
>>
>> James
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail:
[EMAIL PROTECTED]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to