This is most certainly a problem with resources:
while((len = ir.read(cbuf)) > 0) {
rawPage += new String(cbuf);
That is doing java.lang.StringBuffer.toString() and so is this:
logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb,
skipping.");
-----Original Message-----
From: James Ostheimer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 14, 2006 12:54 PM
To: HttpClient User Discussion
Subject: Re: Memory leak using httpclient
Oleg-
I noticed that javadocs reccomend using getResponseBodyAsStream, and I
changed my get code as follows:
private String getPageApache(URL pageURL, ArrayList unProcessed,
ArrayList
tooBig) {
SaveURL saveURL = new SaveURL();
HttpMethod method = null;
HttpURLConnection urlConnection = null;
BufferedReader br = null;
InputStream is = null;
InputStreamReader ir = null;
String rawPage = new String();
try {
method = new GetMethod(getExternalForm(pageURL));
method.setFollowRedirects(true);
method.setRequestHeader("Content-type", "text/html");
int statusCode = httpClient.executeMethod(method);
// urlConnection = new HttpURLConnection(method,
// pageURL);
logger.debug("Requesting: "+pageURL.toExternalForm());
is = (InputStream) method.getResponseBodyAsStream();
if(is == null) {
return null;
}
ir = new InputStreamReader(is);
char[] cbuf = new char[1024];
int len = 0;
while((len = ir.read(cbuf)) > 0) {
rawPage += new String(cbuf);
if(rawPage.getBytes().length > (100000)) {
logger.info("Page "+pageURL.toExternalForm()+"is greater than
100kb,
skipping.");
tooBig.add(pageURL);
return null;
}
}
// br= new BufferedReader(new InputStreamReader(is));
// String tmp = new String();
// while((tmp = br.readLine()) != null) {
// rawPage += tmp;
// if(rawPage.getBytes().length > (100000)) {
// logger.info("Page "+pageURL.toExternalForm()+"is greater than
100kb,
skipping.");
// tooBig.add(pageURL);
// return null;
// }
// }
//System.out.println("Page size: "+rawPage.getBytes().length+"Memory
used:"+Runtime.getRuntime().freeMemory() +" Memory usage (amount left):
"+(Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())
);
//rawPage = method.getResponseBodyAsString();
//rawPage = saveURL.getURL(urlConnection);
if(rawPage == null){
unProcessed.add(pageURL);
}
return rawPage;
} catch (IllegalArgumentException e) {
//e.printStackTrace();
}
catch (HttpException e) {
//e.printStackTrace();
} catch (IOException e) {
unProcessed.add(pageURL);
//e.printStackTrace();
}finally {
try {
if(ir != null) {
ir.close();
}
} catch (IOException e3) {
e3.printStackTrace();
}
try {
if(is != null)
is.close();
} catch (IOException e2) {
// TODO Auto-generated catch block
e2.printStackTrace();
}
try {
if(br != null)
br.close();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
if(method != null) {
method.releaseConnection();
}
try {
if(urlConnection != null) {
if(urlConnection.getInputStream() != null) {
urlConnection.getInputStream().close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
urlConnection = null;
method = null;
}
return null;
}
So I now quit on any file over 100 kb. My memory still steadily rises
and
now Jprofiler claims:
a.. 72.7% - 102,068 kB - 2,463 alloc.
java.lang.StringBuffer.toString()
from getPageApache(). That means that somewhere in the above code
(including calls into HttpClient) I am eating up a lot of memory in
StringBuffers.
I don't ostensibly have any StringBuffers in my code, but suspected that
one
was used behind the scenes in BufferedReader, so in the commented code
above
you can see where I tried using a straight InputStreamReader, the memory
still steadily rose until it failed and JProfiler reported the same as
above
that a StringBuffer was running away. There is little chance that the
StringBuffer is in my code, as far as I can see.
I am asking for more help as this does not seem to just be a problem
with
using getResponseBodyAsString.
James
----- Original Message -----
From: "Oleg Kalnichevski" <[EMAIL PROTECTED]>
To: "HttpClient User Discussion" <[email protected]>
Sent: Tuesday, March 14, 2006 5:57 AM
Subject: Re: Memory leak using httpclient
> On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote:
>> Hi-
>>
>> I am using httpclient in a multi-threaded webcrawler application. I
am
>> using the MulitThreadedHttpConnectionManager in conjunction with 300
>> threads that download pages from various sites.
>>
>> Problem is that I am running out of memory shortly after the process
>> begins. I used JProfiler to analyze the memory stacks and it points
to:
>> a.. 76.2% - 233,587 kB - 6,626 alloc.
>> org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString
>> as the culprit (at most there should be a little over 300 allocations
as
>> there are 300 threads operating at once). Other relevant
information, I
>> am on a Windows XP Pro platform using the SUN JRE that came with
>> jdk1.5.0_06. I am using commons-httpclient-3.0.jar.
>>
>
> James,
>
> There's no memory leak in HttpClient. Just do not use
> HttpMethod#getResponseBodyAsString() method which is not intended to
> retrieval of response entities of arbitrary length, because it buffers
> the entire response content in memory in order to to convert it a
> String. If your crawler hits a site that generates an endless stream
of
> garbage the JVM is bound to run out of memory.
>
> Use getResponseBodyAsStream() instead.
>
> Hope this helps
>
> Oleg
>
>> Here is the code where I initialize the HttpClient:
>>
>> private HttpClient httpClient;
>>
>> public CrawlerControllerThread(QueueThread qt, MessageReceiver
receiver,
>> int maxThreads, String flag,
>> boolean filter, String filterString, String dbType) {
>> this.qt = qt;
>> this.receiver = receiver;
>> this.maxThreads = maxThreads;
>> this.flag = flag;
>> this.filter = filter;
>> this.filterString = filterString;
>> this.dbType = dbType;
>> threads = new ArrayList();
>> lastStatus = new HashMap();
>>
>> HttpConnectionManagerParams htcmp = new
HttpConnectionManagerParams();
>> htcmp.setMaxTotalConnections(maxThreads);
>> htcmp.setDefaultMaxConnectionsPerHost(10);
>> htcmp.setSoTimeout(5000);
>> MultiThreadedHttpConnectionManager mtcm = new
>> MultiThreadedHttpConnectionManager();
>> mtcm.setParams(htcmp);
>> httpClient = new HttpClient(mtcm);
>>
>>
>> }
>>
>> The client reference to httpClient is then passed to all the crawling
>> threads where it is used as follows:
>>
>> private String getPageApache(URL pageURL, ArrayList unProcessed) {
>> SaveURL saveURL = new SaveURL();
>> HttpMethod method = null;
>> HttpURLConnection urlConnection = null;
>> String rawPage = "";
>> try {
>> method = new GetMethod(pageURL.toExternalForm());
>> method.setFollowRedirects(true);
>> method.setRequestHeader("Content-type", "text/html");
>> int statusCode = httpClient.executeMethod(method);
>> // urlConnection = new HttpURLConnection(method,
>> // pageURL);
>> logger.debug("Requesting: "+pageURL.toExternalForm());
>>
>>
>> rawPage = method.getResponseBodyAsString();
>> //rawPage = saveURL.getURL(urlConnection);
>> if(rawPage == null){
>> unProcessed.add(pageURL);
>> }
>> return rawPage;
>> } catch (IllegalArgumentException e) {
>> //e.printStackTrace();
>>
>> }
>> catch (HttpException e) {
>>
>> //e.printStackTrace();
>> } catch (IOException e) {
>> unProcessed.add(pageURL);
>> //e.printStackTrace();
>> }finally {
>> if(method != null) {
>> method.releaseConnection();
>> }
>> try {
>> if(urlConnection != null) {
>> if(urlConnection.getInputStream() != null) {
>> urlConnection.getInputStream().close();
>> }
>> }
>> } catch (IOException e) {
>> // TODO Auto-generated catch block
>> e.printStackTrace();
>> }
>> urlConnection = null;
>> method = null;
>> }
>> return null;
>> }
>>
>> As you can see, I release the connection in the finally statement, so
>> that should not be a problem. Upon running the getPageApache above
the
>> returned page as a string is processed and then set to null for
garbage
>> collection. I have been playing with this, closing streams, using
>> HttpUrlConnection instead of the GetMethod, and I cannot find the
answer.
>> Indeed it seems the answer does not lie in my code.
>>
>> I greatly appreciate any help that anyone can give me, I am at the
end of
>> my ropes with this one.
>>
>> James
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail:
[EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]