[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2008-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566110#action_12566110
 ] 

Andrzej Bialecki  commented on NUTCH-339:
-

Fetcher2 has been committed long ago - I'm closing this. If any remaining 
matters still need to be solved please create a separate issue.

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: https://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: Fetcher2 for .81, patch.txt, patch2.txt, patch3.txt, 
 patch4-fixed.txt, patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2007-01-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467272
 ] 

Andrzej Bialecki  commented on NUTCH-339:
-

Well, then this version doesn't work correctly - the performance improvement 
you see is a result of violating robots.xt and politeness settings.

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: https://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: Fetcher2 for .81, patch.txt, patch2.txt, patch3.txt, 
 patch4-fixed.txt, patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453820 ] 

Andrzej Bialecki  commented on NUTCH-339:
-

This looks weird, if anything it rather seems caused by a bug in Hadoop - are 
you able to run readseg -dump on this fetchlist?

Another idea: do you have any lease expired messages in your log about that 
time? It looks like maybe the underlying input stream has been closed.

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
 patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ] 

Sami Siren commented on NUTCH-339:
--

perhaps thath exception is just a consequence of something other like this:

2006-11-27 07:35:09,434 INFO  fetcher.Fetcher2 - -activeThreads=296, 
spinWaiting=204, fetchQueues.totalSize=0
2006-11-27 07:35:09,434 WARN  fetcher.Fetcher2 - Aborting with 296 hung 
threads.2006-11-27 07:35:09,434 INFO  mapred.LocalJobRunner - 3821 pages, 207 
errors, 5.5 pages/s, 780 kb/s,

and the next log entry is:

2006-11-27 07:35:15,443 INFO  mapred.JobClient -  map 100% reduce 0%




 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
 patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ] 

Sami Siren commented on NUTCH-339:
--

I am running with 300 thread, and in parsing mode

thread dump shows:

191 threads waiting on condition
at java.lang.Thread.sleep(Native Method)
at 
org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:422)

71 waiting for monitor entry
at 
org.apache.nutch.fetcher.Fetcher2$FetchItemQueues.getFetchItem(Fetcher2.java:306)
- waiting to lock 0x52fa7328 (a 
org.apache.nutch.fetcher.Fetcher2$FetchItemQueues)
at 
org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:415)

rest are runnable

cpu usage starts low but very quickly in ramps up and machine gets almost 
unresponsive.

fetching speed is low because all cpu goes to something else.


 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
 patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-24 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12452522 ] 

Sami Siren commented on NUTCH-339:
--

patch applies ok, but there's this error when I try to compile:

compile:
 [echo] Compiling plugin: lib-http
[javac] Compiling 4 source files to 
/home/sam/tru/nutch/build/lib-http/classes
[javac] 
/home/sam/tru/nutch/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java:551:
 incompatible types
[javac] found   : 
org.apache.nutch.protocol.http.api.RobotRulesParser.RobotRuleSet
[javac] required: org.apache.nutch.protocol.RobotRules
[javac] return robots.getRobotRulesSet(this, url);
[javac]   ^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error



 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-16 Thread Doug Cutting

Sami Siren wrote:
looks like somebody just enabled email-to-jira-comments-feature. I was 
just wondering would it be good to use this feature more widely.


I think it would be good.  That way mailing list discussion would be 
logged to the bug as well.


This could be achieved by removing the replyto header from messages 
coming from jira so that replies get sent to [EMAIL PROTECTED] (i am 
assuming that is possible). So whenever somebody just hits reply

from email client and writes the comment it would get automatically
attached to correct issue as a comment.


I sent a message to [EMAIL PROTECTED] this morning asking about this. 
If it's possible, and no one objects, I will request it for the Nutch 
mailing lists.


Doug


Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-16 Thread Piotr Kosiorowski

+1

On 10/16/06, Doug Cutting [EMAIL PROTECTED] wrote:

Sami Siren wrote:
 looks like somebody just enabled email-to-jira-comments-feature. I was
 just wondering would it be good to use this feature more widely.

I think it would be good.  That way mailing list discussion would be
logged to the bug as well.

 This could be achieved by removing the replyto header from messages
 coming from jira so that replies get sent to [EMAIL PROTECTED] (i am
 assuming that is possible). So whenever somebody just hits reply
 from email client and writes the comment it would get automatically
 attached to correct issue as a comment.

I sent a message to [EMAIL PROTECTED] this morning asking about this.
If it's possible, and no one objects, I will request it for the Nutch
mailing lists.

Doug



[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-10-13 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12442195 ] 

Sami Siren commented on NUTCH-339:
--


   [[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]]

The original Fetcher is no longer being polite?

Other than that both seem to be working ok based on a very
small crawl I did.

Some thoughts about the design (or perhaps more about how I did it :)

-the FetchQueue implementation could be in own class(file).

-I moved also the class that handles robots parsing to core

-I used existing FibonacciHeap.java (in org.apache.nutch.util) to back 
up the fething queue the priority i used was the time(in seconds) one 
can again fetch from that particular site, you can then use queue.peek 
to see the highest priority site (the one that should be fetched next) 
and check it's time and if needed read more records from recordreader.

-I created new Object Site that i queued, those objects contained a list 
of urls from that site to be fetched from that site and a real time (in 
seconds) when one can fetch again).

-Queue did hide the recordreader so fetcher threads only had to deal 
with this queue

-I didn't add eny special method for robots.rules in Protocol interface 
(it's just like any other resource that's going to be fetched but 
instead when  a http url was read from recordreader for a site that has 
not earlier seen the robots.txt was put as a normal resource for that 
site to be fetched earlier (Site). and when that resource was fetched it 
was advertiset to FetchingQueue wich then parsed it and stored it in 
FetchSite object.

- Also by using this FetchSite object I could easily implement some 
useful methods like block all urls from this site (for example when 
hostname cannot be resolved, or connections constanlty time out etc...)

Attached you can find a simple drawing I did earlier about the new 
fetcher I had in mind - just for a reference if my words are confusing :)

--
  Sami Siren




[demime 1.01d removed an attachment of type image/png which had a name of 
fetcher.png]


 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Sami Siren
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-13 Thread Sami Siren

Sami Siren (JIRA) wrote:


   [[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]]



looks like somebody just enabled email-to-jira-comments-feature. I was 
just wondering would it be good to use this feature more widely.


This could be achieved by removing the replyto header from messages 
coming from jira so that replies get sent to [EMAIL PROTECTED] (i am 
assuming that is possible). So whenever somebody just hits reply

from email client and writes the comment it would get automatically
attached to correct issue as a comment.

--
 Sami Siren



[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-09-08 Thread JIRA
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433354 ] 

Doğacan Güney commented on NUTCH-339:
-

I have made a few changes to Andrzej's latest patch. The biggest change is that 
BLOCKED_ADDR_QUEUE is now a priority queue and cleanExpiredServerBlocks should 
block threads a lot less. I am attaching this as patch3.txt.

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Sami Siren
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-09-07 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433185 ] 

Sami Siren commented on NUTCH-339:
--

Andrzej,

are you still working with this or should I proceed as I originally planned ;)

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Sami Siren
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-04 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425763 ] 

Andrzej Bialecki  commented on NUTCH-339:
-

Great minds think alike ... ;) I started doing exactly this, and so far my 
patches seem to follow all requirements.

Here's my work-in-progress patch. Warning: not tested!

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.9
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Sami Siren

 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-04 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12425782 ] 

Sami Siren commented on NUTCH-339:
--

I am not sure to what you refer to by this 3-4 sec but yes I agree threre are 
more aspects to optimize in fetcher, what I was firstly concerned was the 
fetching IO speed what was getting ridiculously low (not quite sure when this 
happened).

We should open more than one ticket to track these separate aspects. And for 
general discussion the mailing lista are perhaps the best place.




 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.9
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Sami Siren
 Attachments: patch.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira