[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453798 ] Sami Siren commented on NUTCH-339: ----------------------------------
When running a test fetch with Fetcher2 I enountered this error after fetching few thousand pages (of 1 million segment): Exception in thread "QueueFeeder" java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:244) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:296) at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1433) at org.apache.hadoop.mapred.SequenceFileRecordReader.getPos(SequenceFileRecordReader.java:97) at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:196) at org.apache.nutch.fetcher.Fetcher2$QueueFeeder.run(Fetcher2.java:364) > Refactor nutch to allow fetcher improvements > -------------------------------------------- > > Key: NUTCH-339 > URL: http://issues.apache.org/jira/browse/NUTCH-339 > Project: Nutch > Issue Type: Task > Components: fetcher > Affects Versions: 0.8 > Environment: n/a > Reporter: Sami Siren > Assigned To: Andrzej Bialecki > Fix For: 0.9.0 > > Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, > patch4-trunk.txt > > > As I (and Stefan?) see it there are two major areas the current fetcher could > be > improved (as in speed) > 1. Politeness code and how it is implemented is the biggest > problem of current fetcher(together with robots.txt handling). > With a simple code changes like replacing it with a PriorityQueue > based solution showed very promising results in increased IO. > 2. Changing fetcher to use non blocking io (this requires great amount > of work as we need to implement the protocols from scratch again). > I would like to start with working towards #1 by first refactoring > the current code (plugins actually) in following way: > 1. Move robots.txt handling away from (lib-http)plugin. > Even if this is related only to http, leaving it to lib-http > does not allow other kinds of scheduling strategies to be implemented > (it is hardcoded to fetch robots.txt from the same thread when requesting > a page from a site from witch it hasn't tried to load robots.txt) > 2. Move code for politeness away from (lib-http)plugin > It is really usable outside http and also the current design limits > changing of the implementation (to queue based) > Where to move these, well my suggestion is the nutch core, does anybody > see problems with this? > These code refactoring activities are to be done in a way that none > of the current functionality is (at least deliberately) changed leaving > current functionality as is thus leaving room and possibility to build > the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira