[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ https://issues.apache.org/jira/browse/NUTCH-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566110#action_12566110 ] Andrzej Bialecki commented on NUTCH-339: - Fetcher2 has been committed long ago - I'm closing this. If any remaining matters still need to be solved please create a separate issue. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: https://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: Fetcher2 for .81, patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ https://issues.apache.org/jira/browse/NUTCH-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467272 ] Andrzej Bialecki commented on NUTCH-339: - Well, then this version doesn't work correctly - the performance improvement you see is a result of violating robots.xt and politeness settings. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: https://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: Fetcher2 for .81, patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453820 ] Andrzej Bialecki commented on NUTCH-339: - This looks weird, if anything it rather seems caused by a bug in Hadoop - are you able to run readseg -dump on this fetchlist? Another idea: do you have any lease expired messages in your log about that time? It looks like maybe the underlying input stream has been closed. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ] Sami Siren commented on NUTCH-339: -- perhaps thath exception is just a consequence of something other like this: 2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 - -activeThreads=296, spinWaiting=204, fetchQueues.totalSize=0 2006-11-27 07:35:09,434 WARN fetcher.Fetcher2 - Aborting with 296 hung threads.2006-11-27 07:35:09,434 INFO mapred.LocalJobRunner - 3821 pages, 207 errors, 5.5 pages/s, 780 kb/s, and the next log entry is: 2006-11-27 07:35:15,443 INFO mapred.JobClient - map 100% reduce 0% Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ] Sami Siren commented on NUTCH-339: -- I am running with 300 thread, and in parsing mode thread dump shows: 191 threads waiting on condition at java.lang.Thread.sleep(Native Method) at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:422) 71 waiting for monitor entry at org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.getFetchItem(Fetcher2.java:306) - waiting to lock 0x52fa7328 (a org.apache.nutch.fetcher.Fetcher2$FetchItemQueues) at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:415) rest are runnable cpu usage starts low but very quickly in ramps up and machine gets almost unresponsive. fetching speed is low because all cpu goes to something else. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12452522 ] Sami Siren commented on NUTCH-339: -- patch applies ok, but there's this error when I try to compile: compile: [echo] Compiling plugin: lib-http [javac] Compiling 4 source files to /home/sam/tru/nutch/build/lib-http/classes [javac] /home/sam/tru/nutch/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java:551: incompatible types [javac] found : org.apache.nutch.protocol.http.api.RobotRulesParser.RobotRuleSet [javac] required: org.apache.nutch.protocol.RobotRules [javac] return robots.getRobotRulesSet(this, url); [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)
Sami Siren wrote: looks like somebody just enabled email-to-jira-comments-feature. I was just wondering would it be good to use this feature more widely. I think it would be good. That way mailing list discussion would be logged to the bug as well. This could be achieved by removing the replyto header from messages coming from jira so that replies get sent to [EMAIL PROTECTED] (i am assuming that is possible). So whenever somebody just hits reply from email client and writes the comment it would get automatically attached to correct issue as a comment. I sent a message to [EMAIL PROTECTED] this morning asking about this. If it's possible, and no one objects, I will request it for the Nutch mailing lists. Doug
Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)
+1 On 10/16/06, Doug Cutting [EMAIL PROTECTED] wrote: Sami Siren wrote: looks like somebody just enabled email-to-jira-comments-feature. I was just wondering would it be good to use this feature more widely. I think it would be good. That way mailing list discussion would be logged to the bug as well. This could be achieved by removing the replyto header from messages coming from jira so that replies get sent to [EMAIL PROTECTED] (i am assuming that is possible). So whenever somebody just hits reply from email client and writes the comment it would get automatically attached to correct issue as a comment. I sent a message to [EMAIL PROTECTED] this morning asking about this. If it's possible, and no one objects, I will request it for the Nutch mailing lists. Doug
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12442195 ] Sami Siren commented on NUTCH-339: -- [[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]] The original Fetcher is no longer being polite? Other than that both seem to be working ok based on a very small crawl I did. Some thoughts about the design (or perhaps more about how I did it :) -the FetchQueue implementation could be in own class(file). -I moved also the class that handles robots parsing to core -I used existing FibonacciHeap.java (in org.apache.nutch.util) to back up the fething queue the priority i used was the time(in seconds) one can again fetch from that particular site, you can then use queue.peek to see the highest priority site (the one that should be fetched next) and check it's time and if needed read more records from recordreader. -I created new Object Site that i queued, those objects contained a list of urls from that site to be fetched from that site and a real time (in seconds) when one can fetch again). -Queue did hide the recordreader so fetcher threads only had to deal with this queue -I didn't add eny special method for robots.rules in Protocol interface (it's just like any other resource that's going to be fetched but instead when a http url was read from recordreader for a site that has not earlier seen the robots.txt was put as a normal resource for that site to be fetched earlier (Site). and when that resource was fetched it was advertiset to FetchingQueue wich then parsed it and stored it in FetchSite object. - Also by using this FetchSite object I could easily implement some useful methods like block all urls from this site (for example when hostname cannot be resolved, or connections constanlty time out etc...) Attached you can find a simple drawing I did earlier about the new fetcher I had in mind - just for a reference if my words are confusing :) -- Sami Siren [demime 1.01d removed an attachment of type image/png which had a name of fetcher.png] Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Sami Siren Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)
Sami Siren (JIRA) wrote: [[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]] looks like somebody just enabled email-to-jira-comments-feature. I was just wondering would it be good to use this feature more widely. This could be achieved by removing the replyto header from messages coming from jira so that replies get sent to [EMAIL PROTECTED] (i am assuming that is possible). So whenever somebody just hits reply from email client and writes the comment it would get automatically attached to correct issue as a comment. -- Sami Siren
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433354 ] Doğacan Güney commented on NUTCH-339: - I have made a few changes to Andrzej's latest patch. The biggest change is that BLOCKED_ADDR_QUEUE is now a priority queue and cleanExpiredServerBlocks should block threads a lot less. I am attaching this as patch3.txt. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Sami Siren Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433185 ] Sami Siren commented on NUTCH-339: -- Andrzej, are you still working with this or should I proceed as I originally planned ;) Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Sami Siren Fix For: 0.9.0 Attachments: patch.txt, patch2.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425763 ] Andrzej Bialecki commented on NUTCH-339: - Great minds think alike ... ;) I started doing exactly this, and so far my patches seem to follow all requirements. Here's my work-in-progress patch. Warning: not tested! Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.9 Environment: n/a Reporter: Sami Siren Assigned To: Sami Siren As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425782 ] Sami Siren commented on NUTCH-339: -- I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened). We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.9 Environment: n/a Reporter: Sami Siren Assigned To: Sami Siren Attachments: patch.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira