I want to implement Nutch crawl a filesystem and if the content of the
filesystem has changed since last crawled then and the system should be
fetched again. I studied the code for the Adaptive Re-Fetch cycle but the
patch is out of date as Nutch has implemented other features. Also, I don't
want to change anything to the core code so that I can easily migrate to
newer version. I want to develop the feature as a plugin similar to the
Protocol-File plugin.

 

I have been digging in the source code for the Protocol-File plugin and
therefore have a few questions:

 

My Nutch Revision is: 475201 from the subversion server.

 

In the class File.java (Protocol-File plugin) , the getProtolOuput method
has a condition as follow:

 

<Line 62>

else if ((code >= 300 && code < 400) && code != 304) {     // handle
redirect

                    if (redirects == MAX_REDIRECTS)

                        throw new FileException("Too many redirects: " +
url);

                    u = new URL(response.getHeader("Location"));

                    redirects++;

                    if (LOG.isTraceEnabled()) {

                        LOG.trace("redirect to " + u);

                    }

 

In my case, if the file has not been modified, the code will be 304 (NOT
MODIFIED). I want to know the effect of this line on the CrawlDB. The file
should not be removed or marked as GONE but the
CrawlDatum.STATUS_FETCH_GONE. If that's the case already, this that mean I
don't have to write a plugin to handle the checking of unmodified content.
If not, tell me how the Protocol-File plugins check for unmodified content
as it says it mimic an http response.

 

Armel

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to