Oleg-
I noticed that javadocs reccomend using getResponseBodyAsStream, and I
changed my get code as follows:
private String getPageApache(URL pageURL, ArrayList unProcessed, ArrayList
tooBig) {
SaveURL saveURL = new SaveURL();
HttpMethod method = null;
HttpURLConnection urlConnection = null;
BufferedReader br = null;
InputStream is = null;
InputStreamReader ir = null;
String rawPage = new String();
try {
method = new GetMethod(getExternalForm(pageURL));
method.setFollowRedirects(true);
method.setRequestHeader("Content-type", "text/html");
int statusCode = httpClient.executeMethod(method);
// urlConnection = new HttpURLConnection(method,
// pageURL);
logger.debug("Requesting: "+pageURL.toExternalForm());
is = (InputStream) method.getResponseBodyAsStream();
if(is == null) {
return null;
}
ir = new InputStreamReader(is);
char[] cbuf = new char[1024];
int len = 0;
while((len = ir.read(cbuf)) > 0) {
rawPage += new String(cbuf);
if(rawPage.getBytes().length > (100000)) {
logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb,
skipping.");
tooBig.add(pageURL);
return null;
}
}
// br= new BufferedReader(new InputStreamReader(is));
// String tmp = new String();
// while((tmp = br.readLine()) != null) {
// rawPage += tmp;
// if(rawPage.getBytes().length > (100000)) {
// logger.info("Page "+pageURL.toExternalForm()+"is greater than 100kb,
skipping.");
// tooBig.add(pageURL);
// return null;
// }
// }
//System.out.println("Page size: "+rawPage.getBytes().length+"Memory
used:"+Runtime.getRuntime().freeMemory() +" Memory usage (amount left):
"+(Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()));
//rawPage = method.getResponseBodyAsString();
//rawPage = saveURL.getURL(urlConnection);
if(rawPage == null){
unProcessed.add(pageURL);
}
return rawPage;
} catch (IllegalArgumentException e) {
//e.printStackTrace();
}
catch (HttpException e) {
//e.printStackTrace();
} catch (IOException e) {
unProcessed.add(pageURL);
//e.printStackTrace();
}finally {
try {
if(ir != null) {
ir.close();
}
} catch (IOException e3) {
e3.printStackTrace();
}
try {
if(is != null)
is.close();
} catch (IOException e2) {
// TODO Auto-generated catch block
e2.printStackTrace();
}
try {
if(br != null)
br.close();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
if(method != null) {
method.releaseConnection();
}
try {
if(urlConnection != null) {
if(urlConnection.getInputStream() != null) {
urlConnection.getInputStream().close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
urlConnection = null;
method = null;
}
return null;
}
So I now quit on any file over 100 kb. My memory still steadily rises and
now Jprofiler claims:
a.. 72.7% - 102,068 kB - 2,463 alloc. java.lang.StringBuffer.toString()
from getPageApache(). That means that somewhere in the above code
(including calls into HttpClient) I am eating up a lot of memory in
StringBuffers.
I don't ostensibly have any StringBuffers in my code, but suspected that one
was used behind the scenes in BufferedReader, so in the commented code above
you can see where I tried using a straight InputStreamReader, the memory
still steadily rose until it failed and JProfiler reported the same as above
that a StringBuffer was running away. There is little chance that the
StringBuffer is in my code, as far as I can see.
I am asking for more help as this does not seem to just be a problem with
using getResponseBodyAsString.
James
----- Original Message -----
From: "Oleg Kalnichevski" <[EMAIL PROTECTED]>
To: "HttpClient User Discussion" <[email protected]>
Sent: Tuesday, March 14, 2006 5:57 AM
Subject: Re: Memory leak using httpclient
On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote:
Hi-
I am using httpclient in a multi-threaded webcrawler application. I am
using the MulitThreadedHttpConnectionManager in conjunction with 300
threads that download pages from various sites.
Problem is that I am running out of memory shortly after the process
begins. I used JProfiler to analyze the memory stacks and it points to:
a.. 76.2% - 233,587 kB - 6,626 alloc.
org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString
as the culprit (at most there should be a little over 300 allocations as
there are 300 threads operating at once). Other relevant information, I
am on a Windows XP Pro platform using the SUN JRE that came with
jdk1.5.0_06. I am using commons-httpclient-3.0.jar.
James,
There's no memory leak in HttpClient. Just do not use
HttpMethod#getResponseBodyAsString() method which is not intended to
retrieval of response entities of arbitrary length, because it buffers
the entire response content in memory in order to to convert it a
String. If your crawler hits a site that generates an endless stream of
garbage the JVM is bound to run out of memory.
Use getResponseBodyAsStream() instead.
Hope this helps
Oleg
Here is the code where I initialize the HttpClient:
private HttpClient httpClient;
public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver,
int maxThreads, String flag,
boolean filter, String filterString, String dbType) {
this.qt = qt;
this.receiver = receiver;
this.maxThreads = maxThreads;
this.flag = flag;
this.filter = filter;
this.filterString = filterString;
this.dbType = dbType;
threads = new ArrayList();
lastStatus = new HashMap();
HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
htcmp.setMaxTotalConnections(maxThreads);
htcmp.setDefaultMaxConnectionsPerHost(10);
htcmp.setSoTimeout(5000);
MultiThreadedHttpConnectionManager mtcm = new
MultiThreadedHttpConnectionManager();
mtcm.setParams(htcmp);
httpClient = new HttpClient(mtcm);
}
The client reference to httpClient is then passed to all the crawling
threads where it is used as follows:
private String getPageApache(URL pageURL, ArrayList unProcessed) {
SaveURL saveURL = new SaveURL();
HttpMethod method = null;
HttpURLConnection urlConnection = null;
String rawPage = "";
try {
method = new GetMethod(pageURL.toExternalForm());
method.setFollowRedirects(true);
method.setRequestHeader("Content-type", "text/html");
int statusCode = httpClient.executeMethod(method);
// urlConnection = new HttpURLConnection(method,
// pageURL);
logger.debug("Requesting: "+pageURL.toExternalForm());
rawPage = method.getResponseBodyAsString();
//rawPage = saveURL.getURL(urlConnection);
if(rawPage == null){
unProcessed.add(pageURL);
}
return rawPage;
} catch (IllegalArgumentException e) {
//e.printStackTrace();
}
catch (HttpException e) {
//e.printStackTrace();
} catch (IOException e) {
unProcessed.add(pageURL);
//e.printStackTrace();
}finally {
if(method != null) {
method.releaseConnection();
}
try {
if(urlConnection != null) {
if(urlConnection.getInputStream() != null) {
urlConnection.getInputStream().close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
urlConnection = null;
method = null;
}
return null;
}
As you can see, I release the connection in the finally statement, so
that should not be a problem. Upon running the getPageApache above the
returned page as a string is processed and then set to null for garbage
collection. I have been playing with this, closing streams, using
HttpUrlConnection instead of the GetMethod, and I cannot find the answer.
Indeed it seems the answer does not lie in my code.
I greatly appreciate any help that anyone can give me, I am at the end of
my ropes with this one.
James
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]