Hi, James,

If you are downloading a huge webpage, you probably can't keep it in
memory, as a String or a byte[] or anything else.

Consider saving the webpage to disk, and then dealing with it as a local
file from then on.  Here's how I would save a webpage to disk.  I deal
with it in 4KB pieces:


InputStream in = null;
FileOutputStream out = null;
try
{
  in = method.getResponseBodyAsStream();
  out = new FileOutputStream( "/file/to/write" );
  byte[] buf = new byte[ 4096 ];
  int bytesRead = 0;
  while ( bytesRead != -1 )
  {
    // Read up to 4KB from stream.
    bytesRead = in.read( buf );
    if ( bytesRead > 0 )
    {
      // Write up to 4KB to disk.
      out.write( buf, 0, bytesRead );
    }
  }
}
finally
{
  if ( in != null )
  {
    try
    {
      in.close();               
    }
    catch ( IOException ioe )
    {
      ioe.printStackTrace();     
    }
  }
  if ( out != null )
  {
    try
    {
      out.close();
    }
    catch ( IOException ioe )
    {
      ioe.printStackTrace();
    }
  }
  method.releaseConnection();
}




On Tue, 2006-14-03 at 11:57 +0100, Oleg Kalnichevski wrote:
> On Tue, 2006-03-14 at 01:52 -0500, James Ostheimer wrote: 
> > Hi-
> > 
> > I am using httpclient in a multi-threaded webcrawler application.  I am 
> > using the MulitThreadedHttpConnectionManager in conjunction with 300 
> > threads that download pages from various sites.
> > 
> > Problem is that I am running out of memory shortly after the process 
> > begins.  I used JProfiler to analyze the memory stacks and it points to:
> >   a.. 76.2% - 233,587 kB - 6,626 alloc. 
> > org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString 
> > as the culprit (at most there should be a little over 300 allocations as 
> > there are 300 threads operating at once).  Other relevant information, I am 
> > on a Windows XP Pro platform using the SUN JRE that came with jdk1.5.0_06.  
> > I am using commons-httpclient-3.0.jar.
> > 
> 
> James,
> 
> There's no memory leak in HttpClient. Just do not use
> HttpMethod#getResponseBodyAsString() method which is not intended to
> retrieval of response entities of arbitrary length, because it buffers
> the entire response content in memory in order to to convert it a
> String. If your crawler hits a site that generates an endless stream of
> garbage the JVM is bound to run out of memory.
> 
> Use getResponseBodyAsStream() instead.
> 
> Hope this helps
> 
> Oleg
> 
> > Here is the code where I initialize the HttpClient:
> > 
> > private HttpClient httpClient; 
> >  
> >  public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver, 
> > int maxThreads, String flag,
> >    boolean filter, String filterString, String dbType) {
> >   this.qt = qt;
> >   this.receiver = receiver;
> >   this.maxThreads = maxThreads;
> >   this.flag = flag;
> >   this.filter = filter;
> >   this.filterString = filterString;
> >   this.dbType = dbType;
> >   threads = new ArrayList();
> >   lastStatus = new HashMap();
> >   
> >   HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
> >   htcmp.setMaxTotalConnections(maxThreads);
> >   htcmp.setDefaultMaxConnectionsPerHost(10);
> >   htcmp.setSoTimeout(5000);
> >   MultiThreadedHttpConnectionManager mtcm = new 
> > MultiThreadedHttpConnectionManager();
> >   mtcm.setParams(htcmp);
> >   httpClient = new HttpClient(mtcm);
> >   
> >   
> >  }
> > 
> > The client reference to httpClient is then passed to all the crawling 
> > threads where it is used as follows:
> > 
> > private String getPageApache(URL pageURL, ArrayList unProcessed) {
> >   SaveURL saveURL = new SaveURL();
> >   HttpMethod method = null;
> >   HttpURLConnection urlConnection = null;
> >   String rawPage = "";
> >   try {
> >    method = new GetMethod(pageURL.toExternalForm());
> >    method.setFollowRedirects(true);
> >    method.setRequestHeader("Content-type", "text/html");
> >    int statusCode = httpClient.executeMethod(method);
> > //   urlConnection = new HttpURLConnection(method,
> > //     pageURL);
> >    logger.debug("Requesting: "+pageURL.toExternalForm());
> > 
> >    
> >    rawPage = method.getResponseBodyAsString();
> >    //rawPage = saveURL.getURL(urlConnection);
> >    if(rawPage == null){
> >     unProcessed.add(pageURL);
> >    } 
> >    return rawPage;
> >   } catch (IllegalArgumentException e) {
> >    //e.printStackTrace();
> >    
> >   } 
> >   catch (HttpException e) {
> >    
> >    //e.printStackTrace();
> >   } catch (IOException e) {
> >    unProcessed.add(pageURL);
> >    //e.printStackTrace();
> >   }finally {
> >    if(method != null) {
> >     method.releaseConnection();
> >    }
> >    try {
> >     if(urlConnection != null) {
> >      if(urlConnection.getInputStream() != null) {
> >       urlConnection.getInputStream().close();
> >      }
> >     }
> >    } catch (IOException e) {
> >     // TODO Auto-generated catch block
> >     e.printStackTrace();
> >    }
> >    urlConnection = null;
> >    method = null;
> >   }
> >   return null;
> >  }
> > 
> > As you can see, I release the connection in the finally statement, so that 
> > should not be a problem. Upon running the getPageApache above the returned 
> > page as a string is processed and then set to null for garbage collection. 
> > I have been playing with this, closing streams, using HttpUrlConnection 
> > instead of the GetMethod, and I cannot find the answer.  Indeed it seems 
> > the answer does not lie in my code.  
> > 
> > I greatly appreciate any help that anyone can give me, I am at the end of 
> > my ropes with this one.
> > 
> > James
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
-- 
Julius Davies
Senior Application Developer, Technology Services
Credit Union Central of British Columbia
http://www.cucbc.com/
Tel: 604-730-6385
Cel: 604-868-7571
Fax: 604-737-5910

1441 Creekside Drive
Vancouver, BC
Canada
V6J 4S7

http://juliusdavies.ca/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to