Generate should mark selected records in crawlDB
------------------------------------------------

                 Key: NUTCH-415
                 URL: http://issues.apache.org/jira/browse/NUTCH-415
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8.1, 0.8, 0.8.2, 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 
             Fix For: 0.8.2, 0.9.0


In Nutch 0.7.x, if user ran "generate" twice without intervening "updatedb", 
each fetchlist would be different, because "generate" would mark selected 
entries as "being fetched" (by moving their fetch time one week forward).

In Nutch 0.8 and later, crawldb is not modified at all during "generate". This 
means that two "generate"-s run without intervening "updatedb" will create 
exactly the same fetchlists, which is undesirable.

I propose to re-implement this feature, using the same mechanism. CrawlDB 
update would be performed simultaneously with the first mapred job in 
Generator, and a modified crawldb content would be produced together with an 
(unsorted) fetchlist in Selector, using a custom OutputFormat (patches to 
follow ;) ). Additionally, to ensure that correct version of modified crawldb 
is installed, I propose to add a locking mechanism, which prevents from running 
two processes that modify crawldb simultaneously.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to