[jira] [Created] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

Jorge Luis Betancourt Gonzalez (JIRA) Thu, 04 Jun 2015 05:52:11 -0700

Jorge Luis Betancourt Gonzalez created NUTCH-2036:
-----------------------------------------------------


             Summary: Adding some continuous crawl goodies to the crawl script
                 Key: NUTCH-2036
                 URL: https://issues.apache.org/jira/browse/NUTCH-2036
             Project: Nutch
          Issue Type: Improvement
          Components: bin, tool, util
    Affects Versions: 1.10, 1.11
            Reporter: Jorge Luis Betancourt Gonzalez
            Priority: Minor


Although Nutch does not support continuous crawling out of the box, and yes 
this is somehow doable using cron or even sometimes irrelevant due the size of 
the crawl its a nice feature to have. 

This patch basically just adds a new parameter option to the {{bin/crawl}} 
script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
no URLs are scheduled for fetching). 

This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
provided the amount of time is assumed to be in seconds. Other valid suffixes 
are: 

s - second
m - minutes
h - hours
d - days

If a {{-1}} value is passed to the parameter or its not used at all the default 
behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

Reply via email to