On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote:
> Hi-
>
> I am using httpclient in a multi-threaded webcrawler application. I am using
> the MulitThreadedHttpConnectionManager in conjunction with 300 threads that
> download pages from various sites.
>
> Problem is that I am running out of memory shortly after the process begins.
> I used JProfiler to analyze the memory stacks and it points to:
> a.. 76.2% - 233,587 kB - 6,626 alloc.
> org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString
> as the culprit (at most there should be a little over 300 allocations as
> there are 300 threads operating at once). Other relevant information, I am
> on a Windows XP Pro platform using the SUN JRE that came with jdk1.5.0_06. I
> am using commons-httpclient-3.0.jar.
>
James,
There's no memory leak in HttpClient. Just do not use
HttpMethod#getResponseBodyAsString() method which is not intended to
retrieval of response entities of arbitrary length, because it buffers
the entire response content in memory in order to to convert it a
String. If your crawler hits a site that generates an endless stream of
garbage the JVM is bound to run out of memory.
Use getResponseBodyAsStream() instead.
Hope this helps
Oleg
> Here is the code where I initialize the HttpClient:
>
> private HttpClient httpClient;
>
> public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver, int
> maxThreads, String flag,
> boolean filter, String filterString, String dbType) {
> this.qt = qt;
> this.receiver = receiver;
> this.maxThreads = maxThreads;
> this.flag = flag;
> this.filter = filter;
> this.filterString = filterString;
> this.dbType = dbType;
> threads = new ArrayList();
> lastStatus = new HashMap();
>
> HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
> htcmp.setMaxTotalConnections(maxThreads);
> htcmp.setDefaultMaxConnectionsPerHost(10);
> htcmp.setSoTimeout(5000);
> MultiThreadedHttpConnectionManager mtcm = new
> MultiThreadedHttpConnectionManager();
> mtcm.setParams(htcmp);
> httpClient = new HttpClient(mtcm);
>
>
> }
>
> The client reference to httpClient is then passed to all the crawling threads
> where it is used as follows:
>
> private String getPageApache(URL pageURL, ArrayList unProcessed) {
> SaveURL saveURL = new SaveURL();
> HttpMethod method = null;
> HttpURLConnection urlConnection = null;
> String rawPage = "";
> try {
> method = new GetMethod(pageURL.toExternalForm());
> method.setFollowRedirects(true);
> method.setRequestHeader("Content-type", "text/html");
> int statusCode = httpClient.executeMethod(method);
> // urlConnection = new HttpURLConnection(method,
> // pageURL);
> logger.debug("Requesting: "+pageURL.toExternalForm());
>
>
> rawPage = method.getResponseBodyAsString();
> //rawPage = saveURL.getURL(urlConnection);
> if(rawPage == null){
> unProcessed.add(pageURL);
> }
> return rawPage;
> } catch (IllegalArgumentException e) {
> //e.printStackTrace();
>
> }
> catch (HttpException e) {
>
> //e.printStackTrace();
> } catch (IOException e) {
> unProcessed.add(pageURL);
> //e.printStackTrace();
> }finally {
> if(method != null) {
> method.releaseConnection();
> }
> try {
> if(urlConnection != null) {
> if(urlConnection.getInputStream() != null) {
> urlConnection.getInputStream().close();
> }
> }
> } catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> urlConnection = null;
> method = null;
> }
> return null;
> }
>
> As you can see, I release the connection in the finally statement, so that
> should not be a problem. Upon running the getPageApache above the returned
> page as a string is processed and then set to null for garbage collection. I
> have been playing with this, closing streams, using HttpUrlConnection instead
> of the GetMethod, and I cannot find the answer. Indeed it seems the answer
> does not lie in my code.
>
> I greatly appreciate any help that anyone can give me, I am at the end of my
> ropes with this one.
>
> James
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]