[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Sami Siren (JIRA) Fri, 24 Nov 2006 13:52:35 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12452522 ] 
            
Sami Siren commented on NUTCH-339:
----------------------------------


patch applies ok, but there's this error when I try to compile:

compile:
     [echo] Compiling plugin: lib-http
    [javac] Compiling 4 source files to 
/home/sam/tru/nutch/build/lib-http/classes
    [javac] 
/home/sam/tru/nutch/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java:551:
 incompatible types
    [javac] found   : 
org.apache.nutch.protocol.http.api.RobotRulesParser.RobotRuleSet
    [javac] required: org.apache.nutch.protocol.RobotRules
    [javac]     return robots.getRobotRulesSet(this, url);
    [javac]                                   ^
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 1 error



> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt, patch3.txt, patch4-trunk.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could 
> be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

Reply via email to