If you're crawling web pages, you need to have a limit to the amount of data 
any page returns.

Otherwise you'll eventually run into a site that returns an unbounded amount of 
data, which will kill your JVM.

See SimpleHttpFetcher in Bixo for an example of one way to do this type of 
limiting (though not optimal).

-- Ken


On Feb 10, 2014, at 8:07pm, Li Li <[email protected]> wrote:

> I am using httpclient 4.3 to crawl webpages.
> I start 200 threads and PoolingHttpClientConnectionManager with
> totalMax 1000 and perHostMax 5
> I give java 2GB memory and one thread throws an exception(others still
> running, this thread is dead)
> 
> Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
>        at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
>        at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)
>        at 
> com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
>        at 
> com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
>        at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
>        at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
>        at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
>        at 
> com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
>        at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
>        at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
>        at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)
> 
> does it mean my code has some memory leak probelm?
> 
> my codes:
> public String httpGet(String url) throws Exception {
> if (!isValid)
> throw new RuntimeException("not valid now, you should init first");
> HttpGet httpget = new HttpGet(url);
> 
> // Create a custom response handler
> ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
> 
> public String handleResponse(final HttpResponse response)
> throws ClientProtocolException, IOException {
> int status = response.getStatusLine().getStatusCode();
> if (status >= 200 && status < 300) {
> HttpEntity entity = response.getEntity();
> if (entity == null)
> return null;
> 
> byte[] bytes = EntityUtils.toByteArray(entity);
> String charSet = CharsetDetector.getCharset(bytes);
> 
> return new String(bytes, charSet);
> } else {
> throw new ClientProtocolException(
> "Unexpected response status: " + status);
> }
> }
> 
> };
> 
> String responseBody = client.execute(httpget, responseHandler);
> return responseBody;
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to