Hi, James,
If you are downloading a huge webpage, you probably can't keep it in
memory, as a String or a byte[] or anything else.
Consider saving the webpage to disk, and then dealing with it as a local
file from then on. Here's how I would save a webpage to disk. I deal
with it in 4KB pieces:
InputStream in = null;
FileOutputStream out = null;
try
{
in = method.getResponseBodyAsStream();
out = new FileOutputStream( "/file/to/write" );
byte[] buf = new byte[ 4096 ];
int bytesRead = 0;
while ( bytesRead != -1 )
{
// Read up to 4KB from stream.
bytesRead = in.read( buf );
if ( bytesRead > 0 )
{
// Write up to 4KB to disk.
out.write( buf, 0, bytesRead );
}
}
}
finally
{
if ( in != null )
{
try
{
in.close();
}
catch ( IOException ioe )
{
ioe.printStackTrace();
}
}
if ( out != null )
{
try
{
out.close();
}
catch ( IOException ioe )
{
ioe.printStackTrace();
}
}
method.releaseConnection();
}
On Tue, 2006-14-03 at 11:57 +0100, Oleg Kalnichevski wrote:
> On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote:
> > Hi-
> >
> > I am using httpclient in a multi-threaded webcrawler application. I am
> > using the MulitThreadedHttpConnectionManager in conjunction with 300
> > threads that download pages from various sites.
> >
> > Problem is that I am running out of memory shortly after the process
> > begins. I used JProfiler to analyze the memory stacks and it points to:
> > a.. 76.2% - 233,587 kB - 6,626 alloc.
> > org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString
> > as the culprit (at most there should be a little over 300 allocations as
> > there are 300 threads operating at once). Other relevant information, I am
> > on a Windows XP Pro platform using the SUN JRE that came with jdk1.5.0_06.
> > I am using commons-httpclient-3.0.jar.
> >
>
> James,
>
> There's no memory leak in HttpClient. Just do not use
> HttpMethod#getResponseBodyAsString() method which is not intended to
> retrieval of response entities of arbitrary length, because it buffers
> the entire response content in memory in order to to convert it a
> String. If your crawler hits a site that generates an endless stream of
> garbage the JVM is bound to run out of memory.
>
> Use getResponseBodyAsStream() instead.
>
> Hope this helps
>
> Oleg
>
> > Here is the code where I initialize the HttpClient:
> >
> > private HttpClient httpClient;
> >
> > public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver,
> > int maxThreads, String flag,
> > boolean filter, String filterString, String dbType) {
> > this.qt = qt;
> > this.receiver = receiver;
> > this.maxThreads = maxThreads;
> > this.flag = flag;
> > this.filter = filter;
> > this.filterString = filterString;
> > this.dbType = dbType;
> > threads = new ArrayList();
> > lastStatus = new HashMap();
> >
> > HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
> > htcmp.setMaxTotalConnections(maxThreads);
> > htcmp.setDefaultMaxConnectionsPerHost(10);
> > htcmp.setSoTimeout(5000);
> > MultiThreadedHttpConnectionManager mtcm = new
> > MultiThreadedHttpConnectionManager();
> > mtcm.setParams(htcmp);
> > httpClient = new HttpClient(mtcm);
> >
> >
> > }
> >
> > The client reference to httpClient is then passed to all the crawling
> > threads where it is used as follows:
> >
> > private String getPageApache(URL pageURL, ArrayList unProcessed) {
> > SaveURL saveURL = new SaveURL();
> > HttpMethod method = null;
> > HttpURLConnection urlConnection = null;
> > String rawPage = "";
> > try {
> > method = new GetMethod(pageURL.toExternalForm());
> > method.setFollowRedirects(true);
> > method.setRequestHeader("Content-type", "text/html");
> > int statusCode = httpClient.executeMethod(method);
> > // urlConnection = new HttpURLConnection(method,
> > // pageURL);
> > logger.debug("Requesting: "+pageURL.toExternalForm());
> >
> >
> > rawPage = method.getResponseBodyAsString();
> > //rawPage = saveURL.getURL(urlConnection);
> > if(rawPage == null){
> > unProcessed.add(pageURL);
> > }
> > return rawPage;
> > } catch (IllegalArgumentException e) {
> > //e.printStackTrace();
> >
> > }
> > catch (HttpException e) {
> >
> > //e.printStackTrace();
> > } catch (IOException e) {
> > unProcessed.add(pageURL);
> > //e.printStackTrace();
> > }finally {
> > if(method != null) {
> > method.releaseConnection();
> > }
> > try {
> > if(urlConnection != null) {
> > if(urlConnection.getInputStream() != null) {
> > urlConnection.getInputStream().close();
> > }
> > }
> > } catch (IOException e) {
> > // TODO Auto-generated catch block
> > e.printStackTrace();
> > }
> > urlConnection = null;
> > method = null;
> > }
> > return null;
> > }
> >
> > As you can see, I release the connection in the finally statement, so that
> > should not be a problem. Upon running the getPageApache above the returned
> > page as a string is processed and then set to null for garbage collection.
> > I have been playing with this, closing streams, using HttpUrlConnection
> > instead of the GetMethod, and I cannot find the answer. Indeed it seems
> > the answer does not lie in my code.
> >
> > I greatly appreciate any help that anyone can give me, I am at the end of
> > my ropes with this one.
> >
> > James
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
--
Julius Davies
Senior Application Developer, Technology Services
Credit Union Central of British Columbia
http://www.cucbc.com/
Tel: 604-730-6385
Cel: 604-868-7571
Fax: 604-737-5910
1441 Creekside Drive
Vancouver, BC
Canada
V6J 4S7
http://juliusdavies.ca/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]