Hi Oleg,

Thanks for the responses. I've filed a Bixo issue to try using the new minimal 
version of HttpClient, and also the unlimited connection manager.

I'll try to test using an existing crawl workflow that hits the top-level pages 
for 60K domains, though that's not exactly the same as a large-scale crawl.

-- Ken


On Jan 7, 2013, at 2:39am, Oleg Kalnichevski wrote:

> On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote:
>> Hi Oleg,
>> 
>> [snip]
>> 
>>> Ken,
>>> 
>>> You might want to have a look at the lest code in SVN trunk (to be
>>> released as 4.3). Several classes such as the scheme registry that
>>> previously had to be synchronized in order to ensure thread safety have
>>> been replaced with immutable equivalents. There is also now a way to
>>> create HttpClient in a minimal configuration without authentication,
>>> state management (cookies), proxy support and other non-essential
>>> functions.
>> 
>> That sounds interesting - any hints as to how to create this minimal 
>> HttpClient?
>> 
> 
> The new API is not yet final and not properly documented. Presently this
> can be done with HttpClients#createMinimal
> 
> 
>>> These functions are not merely disabled but physically
>>> removed from the processing pipeline, which should result in somewhat
>>> better performance in high threads contention scenarios, as the only
>>> synchronization point involved in request execution would be the lock of
>>> the connection pool. Minimal HttpClient may be particularly useful for
>>> anonymous web crawling when authentication and state management are not
>>> required.
>>> 
>>> 
>>>> 3. Global lock on connection pool
>>>> 
>>>> Oleg had written:
>>>> 
>>>>> Yes, your observation is correct. The problem is that the connection
>>>>> pool is guarded by a global lock. Naturally if you have 400 threads
>>>>> trying to obtain a connection at about the same time all of them end up
>>>>> contending for one lock. The problem is that I can't think of a
>>>>> different way to ensure the max limits (per route and total) are
>>>>> guaranteed not to be exceeded. If anyone can think of a better algorithm
>>>>> please do let me know. What might be a possibility is creating a more
>>>>> lenient and less prone to lock contention issues implementation that may
>>>>> under stress occasionally allocate a few more connections than the max
>>>>> limits.
>>>> 
>>>> I don't know if this has been resolved. My work-around from a few years 
>>>> ago was to rely on having multiple Hadoop reducers running on the server 
>>>> (each in their own JVM), where I could then limit each JVM to at most 300 
>>>> connections.
>>>> 
>>> 
>>> I experimented with the idea of lock-less (unlimited) connection manager
>>> but in my tests it did not perform any better than the standard
>>> connection manager.
>> 
>> Previously I'd asked:
>> 
>>> Would it work to go for finer-grained locking, by using atomic counters to 
>>> track & enforce limits on per route/total connections?
>> 
>> Any thoughts on that approach? E.g. have a map from route to atomic counter, 
>> and a single atomic counter for total connections?
>> 
> 
> This may be worthwhile to try. However, in theory this should not
> perform any better than the approach I took with my experiments. The
> main problem is, though, that I do not have a good test framework that
> emulates an environment a web crawler is expected to operate in (and
> have no justification for building one in my spare time). So, this kind
> of effort ideally should be led by an external contributor.
> 
> Oleg
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to