[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453989 ] 
            
Andrzej Bialecki  commented on NUTCH-339:
-----------------------------------------

Ah, we are getting somewhere ... fetchQueues.totalSize=0 means that all input 
entries from the queues have been processed. You are running with 500 threads, 
out of which 296 threads are still processing requests, and 204 threads are 
idle because they don't have anything more to do (queues are empty).

Are you running in parsing mode? Could you please kill -SIGQUIT <pid> to 
produce a thread dump and see why these threads are waiting? I bet they hang on 
regexes or pdf parsing, or a DOMFragment bug ...

Before this happened, when spinWaiting was still close to 0, what was the 
maximum fetching speed? Was it higher/lower/comparable to the regular fetcher? 
What about the CPU usage?

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
> patch4-trunk.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could 
> be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to