If you're crawling web pages, you need to have a limit to the amount of data any page returns.
Otherwise you'll eventually run into a site that returns an unbounded amount of data, which will kill your JVM. See SimpleHttpFetcher in Bixo for an example of one way to do this type of limiting (though not optimal). -- Ken On Feb 10, 2014, at 8:07pm, Li Li <[email protected]> wrote: > I am using httpclient 4.3 to crawl webpages. > I start 200 threads and PoolingHttpClientConnectionManager with > totalMax 1000 and perHostMax 5 > I give java 2GB memory and one thread throws an exception(others still > running, this thread is dead) > > Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space > at org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56) > at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133) > at > com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221) > at > com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136) > at > com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233) > at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198) > at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134) > at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231) > > does it mean my code has some memory leak probelm? > > my codes: > public String httpGet(String url) throws Exception { > if (!isValid) > throw new RuntimeException("not valid now, you should init first"); > HttpGet httpget = new HttpGet(url); > > // Create a custom response handler > ResponseHandler<String> responseHandler = new ResponseHandler<String>() { > > public String handleResponse(final HttpResponse response) > throws ClientProtocolException, IOException { > int status = response.getStatusLine().getStatusCode(); > if (status >= 200 && status < 300) { > HttpEntity entity = response.getEntity(); > if (entity == null) > return null; > > byte[] bytes = EntityUtils.toByteArray(entity); > String charSet = CharsetDetector.getCharset(bytes); > > return new String(bytes, charSet); > } else { > throw new ClientProtocolException( > "Unexpected response status: " + status); > } > } > > }; > > String responseBody = client.execute(httpget, responseHandler); > return responseBody; > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
