Hi Oleg, Thanks for the responses. I've filed a Bixo issue to try using the new minimal version of HttpClient, and also the unlimited connection manager.
I'll try to test using an existing crawl workflow that hits the top-level pages for 60K domains, though that's not exactly the same as a large-scale crawl. -- Ken On Jan 7, 2013, at 2:39am, Oleg Kalnichevski wrote: > On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote: >> Hi Oleg, >> >> [snip] >> >>> Ken, >>> >>> You might want to have a look at the lest code in SVN trunk (to be >>> released as 4.3). Several classes such as the scheme registry that >>> previously had to be synchronized in order to ensure thread safety have >>> been replaced with immutable equivalents. There is also now a way to >>> create HttpClient in a minimal configuration without authentication, >>> state management (cookies), proxy support and other non-essential >>> functions. >> >> That sounds interesting - any hints as to how to create this minimal >> HttpClient? >> > > The new API is not yet final and not properly documented. Presently this > can be done with HttpClients#createMinimal > > >>> These functions are not merely disabled but physically >>> removed from the processing pipeline, which should result in somewhat >>> better performance in high threads contention scenarios, as the only >>> synchronization point involved in request execution would be the lock of >>> the connection pool. Minimal HttpClient may be particularly useful for >>> anonymous web crawling when authentication and state management are not >>> required. >>> >>> >>>> 3. Global lock on connection pool >>>> >>>> Oleg had written: >>>> >>>>> Yes, your observation is correct. The problem is that the connection >>>>> pool is guarded by a global lock. Naturally if you have 400 threads >>>>> trying to obtain a connection at about the same time all of them end up >>>>> contending for one lock. The problem is that I can't think of a >>>>> different way to ensure the max limits (per route and total) are >>>>> guaranteed not to be exceeded. If anyone can think of a better algorithm >>>>> please do let me know. What might be a possibility is creating a more >>>>> lenient and less prone to lock contention issues implementation that may >>>>> under stress occasionally allocate a few more connections than the max >>>>> limits. >>>> >>>> I don't know if this has been resolved. My work-around from a few years >>>> ago was to rely on having multiple Hadoop reducers running on the server >>>> (each in their own JVM), where I could then limit each JVM to at most 300 >>>> connections. >>>> >>> >>> I experimented with the idea of lock-less (unlimited) connection manager >>> but in my tests it did not perform any better than the standard >>> connection manager. >> >> Previously I'd asked: >> >>> Would it work to go for finer-grained locking, by using atomic counters to >>> track & enforce limits on per route/total connections? >> >> Any thoughts on that approach? E.g. have a map from route to atomic counter, >> and a single atomic counter for total connections? >> > > This may be worthwhile to try. However, in theory this should not > perform any better than the approach I took with my experiments. The > main problem is, though, that I do not have a good test framework that > emulates an environment a web crawler is expected to operate in (and > have no justification for building one in my spare time). So, this kind > of effort ideally should be led by an external contributor. > > Oleg > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
