Nutch 2.3.1 re-crawls unchanged web pages

2016-11-24 Thread Vladimir Loubenski
Hi , I am using Nutch 2.3.1. I run in loop generate, fetch, parse, updateDB steps. I noted that during re-crawl even if a web page doesn't change nutch doesn't detect it by value of ETag, Last-Modified or signature fields and continue process all these steps for unchanged web pages. Is it ex

[jira] [Commented] (NUTCH-2334) Extension point for schedulers

2016-11-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693599#comment-15693599 ] Sebastian Nagel commented on NUTCH-2334: Hi [~roannel], what does "extension point

Crawler-Commons 0.7 released

2016-11-24 Thread Julien Nioche
Apologies for cross-posting The Common-Crawl project is pleased to announce its 0.7 release. https://github.com/crawler-commons/crawler-commons#24th-november-2016crawler-commons-07-released The list of changes can be found here