[Nutch-dev] A little hack: retrieve only new urls

Enrico Triolo Tue, 14 Feb 2006 02:55:04 -0800

Hi, since I had a little problem and noticed that someone else had it
too, I tried to solve it in my own way...


First let me explain the problem: I'm developing an application in
which the user can classify web resources inside a taxonomy.
The user chooses a seed url and defines the crawling depth, then the
system fetches required web pages and indexes them.
Nutch seems the best solution for this task, but has a little
drawback: in fact when the crawling depth is reached, the webdb  will
contain all fetched urls along with a set of unfetched ones (the
outgoing links from the last iteration). So when the user adds another
url seed, the generate command will create a segment containing the
new url and the ones in DB_UNFETCHED status.

I need to be able to tell nutch when it shouldn't retrieve outgoing links...

Thus I made some little modifications, but since I am new to nutch I
can't yet understand how things work (especially with the hadoop
abstraction layer), and I'm sure there's a better way to approach the
problem.
Anyway, I'll explain my solution, just as a starting point for a
coherent approach :-)

First, I defined a configuration file, disable-outlinks.xml:

<configuration>
        <property>
                <name>crawl.fetch.outlinks</name>
                <value>false</value>
        </property>
</configuration>

Then I patched HtmlParser.java:
   1. I added a 'private boolean fetchOutLinks' property
   2. In the setConf method, I added this instruction:
this.fetchOutLinks = getConf().getBoolean("crawl.fetch.outlinks",
true);
   3. In the getParse() methd I modified this row:
          if (!metaTags.getNoFollow()) {
       with this one:
          if (!metaTags.getNoFollow() && fetchOutLinks) {

I then changed Crawl.java main() method: inside the second for() loop
I added this instruction:
   if( i == (depth - 1) ) job.addDefaultResource("disable-outlinks.xml");

That's it, and it works for me... I know it's a terrible hack, but
please be patient ;-)

Thanks for your attention and for the great job!

Enrico


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] A little hack: retrieve only new urls

Reply via email to