[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Susam Pal updated NUTCH-601: ---------------------------- Attachment: NUTCH-601v1.0.patch Attached another patch (NUTCH-601v1.0.patch) that always deletes the old mergex index as per the suggestion of Andrzej. The v0.4 patch would leave the old merged index with the new segments in case something goes wrong during the generation of new index. Whether the index merger fails or succeeds, we will always have an 'index' directory. So, after the completion of a recrawl, a user may want to verify whether the 'index' directory is the new merged index or the old merged index. This may be confusing. However, one advantage is that one can run a recrawl on the same crawl directory which the web-gui is using to serve the users. This patch minimizes the duration for which the index directory would be unavailable. The v1.0 patch always deletes the old indexes as well as old merged index. Therefore, the old index would never remain once the index generation has begun. If the index merger fails, we won't have an 'index' directory which would be a clear indication of index generation failure. This prevents the confusion discussed above. Please review both the patches and accept whichever the community feels is better. > Recrawling on existing crawl directory using force option > --------------------------------------------------------- > > Key: NUTCH-601 > URL: https://issues.apache.org/jira/browse/NUTCH-601 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Susam Pal > Priority: Minor > Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, > NUTCH-601v0.3.patch, NUTCH-601v1.0.patch > > > Added a '-force' option to the 'bin/nutch crawl' command line. With this > option, one can crawl and recrawl in the following manner: > {code} > bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 > bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force > {code} > This option can be used for the first crawl too: > {code} > bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force > bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force > {code} > If one tries to crawl without the -force option when the crawl directory > already exists, he/she finds a small warning along with the error message: > {code} > # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 > Exception in thread "main" java.lang.RuntimeException: crawl already > exists. Add -force option to recrawl. > at org.apache.nutch.crawl.Crawl.main(Crawl.java:89) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.