On Mon, 2014-02-10 at 20:57 -0800, Ken Krugler wrote:
> If you're crawling web pages, you need to have a limit to the amount of data
> any page returns.
>
> Otherwise you'll eventually run into a site that returns an unbounded amount
> of data, which will kill your JVM.
>
> See SimpleHttpFetcher in Bixo for an example of one way to do this type of
> limiting (though not optimal).
>
> -- Ken
>
>
> On Feb 10, 2014, at 8:07pm, Li Li <[email protected]> wrote:
>
> > I am using httpclient 4.3 to crawl webpages.
> > I start 200 threads and PoolingHttpClientConnectionManager with
> > totalMax 1000 and perHostMax 5
> > I give java 2GB memory and one thread throws an exception(others still
> > running, this thread is dead)
> >
> > Exception in thread "Thread-156" java.lang.OutOfMemoryError: Java heap space
> > at
> > org.apache.http.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:56)
> > at org.apache.http.util.EntityUtils.toByteArray(EntityUtils.java:133)
Moreover, buffering response content in memory (either as byte array or
string) sounds like a really bad idea to me.
Oleg
> > at
> > com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:221)
> > at
> > com.founder.httpclientfetcher.HttpClientFetcher$3.handleResponse(HttpClientFetcher.java:211)
> > at
> > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
> > at
> > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
> > at
> > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
> > at
> > com.founder.httpclientfetcher.HttpClientFetcher.httpGet(HttpClientFetcher.java:233)
> > at com.founder.vcfetcher.CrawlWorker.getContent(CrawlWorker.java:198)
> > at com.founder.vcfetcher.CrawlWorker.doWork(CrawlWorker.java:134)
> > at com.founder.vcfetcher.CrawlWorker.run(CrawlWorker.java:231)
> >
> > does it mean my code has some memory leak probelm?
> >
> > my codes:
> > public String httpGet(String url) throws Exception {
> > if (!isValid)
> > throw new RuntimeException("not valid now, you should init first");
> > HttpGet httpget = new HttpGet(url);
> >
> > // Create a custom response handler
> > ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
> >
> > public String handleResponse(final HttpResponse response)
> > throws ClientProtocolException, IOException {
> > int status = response.getStatusLine().getStatusCode();
> > if (status >= 200 && status < 300) {
> > HttpEntity entity = response.getEntity();
> > if (entity == null)
> > return null;
> >
> > byte[] bytes = EntityUtils.toByteArray(entity);
> > String charSet = CharsetDetector.getCharset(bytes);
> >
> > return new String(bytes, charSet);
> > } else {
> > throw new ClientProtocolException(
> > "Unexpected response status: " + status);
> > }
> > }
> >
> > };
> >
> > String responseBody = client.execute(httpget, responseHandler);
> > return responseBody;
> > }
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]