[jira] [Commented] (NUTCH-1643) Unnecessary fetching with http.content.limit when using protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808006#comment-13808006 ] Talat UYARER commented on NUTCH-1643: - Hi [~lewismc], I can look every protocol for this improvement. But we should consider your second item. I look at ParseUtil.java when parsers return null, it writes warning log. I think this is not a problem. What do you think ? I will uplaod for other protocols. I added parse methods from ParseUtil.java: {code:title=ParseUtil.java|borderStyle=solid} public Parse parse(String url, WebPage page) throws ParserNotFound, ParseException { Parser[] parsers = null; String contentType = TableUtil.toString(page.getContentType()); parsers = this.parserFactory.getParsers(contentType, url); for (int i=0; iparsers.length; i++) { if (LOG.isDebugEnabled()) { LOG.debug(Parsing [ + url + ] with [ + parsers[i] + ]); } Parse parse = null; if (maxParseTime!=-1) parse = runParser(parsers[i], url, page); else parse = parsers[i].getParse(url, page); if (parse!=null ParseStatusUtils.isSuccess(parse.getParseStatus())) { return parse; } } LOG.warn(Unable to successfully parse content + url + of type + contentType); return ParseStatusUtils.getEmptyParse(new ParseException(Unable to successfully parse content), null); } {code} Unnecessary fetching with http.content.limit when using protocol-http - Key: NUTCH-1643 URL: https://issues.apache.org/jira/browse/NUTCH-1643 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 2.1, 2.2, 2.2.1 Reporter: Talat UYARER Priority: Minor Fix For: 2.3 Attachments: NUTCH-1643.patch, NUTCH-1643v2.patch In protocol-http, Even If I have http.content.limit value set, protocol-http fetches files of all sizes (larger files are fetched until limit allows). But when Parsing, parser skips incomplete files (if parser.skip.truncated configuration is true). It seems like an unnecessary effort to partially fetch contents larger than limit if they are not gonna be parsed. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set
[ https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808045#comment-13808045 ] lufeng commented on NUTCH-1651: --- Hi Talat but I think get last modified from header is not appropriate to put in here. If user want to check the modification of a html in parser plugin through it's content of that url not that metadata in html headers. even the value of Last-Modified in headers is changed. {code:java} +Utf8 lastModified = page.getFromHeaders(new Utf8(Last-Modified)); +if ( lastModified != null ){ + try { +modifiedTime = HttpDateFormat.toLong(lastModified.toString()); +prevModifiedTime = page.getModifiedTime(); + } catch (Exception e) { + } +} {code} maybe appropriate way is to let parser plugin defined by user to set the value of modified time not in DbUpdateReducer class. modifiedTime and prevmodifiedTime never set Key: NUTCH-1651 URL: https://issues.apache.org/jira/browse/NUTCH-1651 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1651.patch modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime is set only once in the beginning by zero-control of AdaptiveFetchScheduler. But this is not sufficient since modifiedTime needs to be updated whenever last modified time is available. We corrected this with a patch. Also we noticed that prevModifiedTime is not written to database and we corrected that too. With this patch, whenever lastModifiedTime is available, we do two things. First we set modifiedTime in the Page object to prevModifiedTime. After that we set lastModifiedTime to modifiedTime. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified
[ https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808122#comment-13808122 ] Talat UYARER commented on NUTCH-1564: - [~amuseme.lu] How do you check this problem. Do you use main method in AdaptiveFetchScheduler for checking ? If you use that, that has some issues. [~icebergx5] fixed in TestAdaptiveFetchScheduler.java you can control that. AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified - Key: NUTCH-1564 URL: https://issues.apache.org/jira/browse/NUTCH-1564 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.6, 2.1 Reporter: Sebastian Nagel Priority: Critical In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time are may be fetched in every cycle. A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters): {code} db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days) {code} At Apr 18 a URL is generated and fetched (from segment dump): {code} Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle): {code} Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule: {code} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta interval) interval = delta; refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } if (interval MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval MAX_INTERVAL) { interval = MAX_INTERVAL; } ... datum.setFetchTime(refTime + Math.round(interval * 1000.0)); {code} {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past ({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch should take place 2 days in the past (Apr 16). According to the [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html] (if understood right), there are to aims of the sync_delta if we know that a document hasn't been modified for long: * increase the fetch interval immediately (not step by step) * because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the reference time, i.e. we expect a change soon. These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past. This problem has been noted by [~pascaldimassimo] in [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234]. -- This message was sent by Atlassian JIRA (v6.1#6144)
Lucene SOLR Revolution Dublin
Hi guys Anyone going to http://www.lucenerevolution.org/ next week? I'll be giving a talk on Nutch, ping me on twitter or email if you want to meet up and have a chat. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Updated] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified
[ https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1564: --- Description: In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time are may be fetched in every cycle. A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters): {code} db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days) {code} At Apr 18 a URL is generated and fetched (from segment dump): {code} Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle): {code} Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule: {code} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta interval) interval = delta; refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } if (interval MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval MAX_INTERVAL) { interval = MAX_INTERVAL; } ... datum.setFetchTime(refTime + Math.round(interval * 1000.0)); {code} {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past ({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch should take place 2 days in the past (Apr 16). According to the [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html] (if understood right), there are two aims of the sync_delta if we know that a document hasn't been modified for long: * increase the fetch interval immediately (not step by step) * because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the reference time, i.e. we expect a change soon. These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past. This problem has been noted by [~pascaldimassimo] in [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234]. was: In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time are may be fetched in every cycle. A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters): {code} db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days) {code} At Apr 18 a URL is generated and fetched (from segment dump): {code} Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle): {code} Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule: {code} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if
[jira] [Updated] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified
[ https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1564: --- Description: In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time may be fetched in every cycle. A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters): {code} db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days) {code} At Apr 18 a URL is generated and fetched (from segment dump): {code} Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle): {code} Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule: {code} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta interval) interval = delta; refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } if (interval MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval MAX_INTERVAL) { interval = MAX_INTERVAL; } ... datum.setFetchTime(refTime + Math.round(interval * 1000.0)); {code} {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past ({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch should take place 2 days in the past (Apr 16). According to the [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html] (if understood right), there are two aims of the sync_delta if we know that a document hasn't been modified for long: * increase the fetch interval immediately (not step by step) * because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the reference time, i.e. we expect a change soon. These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past. This problem has been noted by [~pascaldimassimo] in [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234]. was: In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time are may be fetched in every cycle. A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters): {code} db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days) {code} At Apr 18 a URL is generated and fetched (from segment dump): {code} Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle): {code} Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) {code} This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule: {code} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if
RE: Lucene SOLR Revolution Dublin
Won't be there as usual but i'll be sure to check out your talk if it's available! :) cheers mate -Original message- From: Julien Niochelists.digitalpeb...@gmail.com Sent: Tuesday 29th October 2013 16:27 To: u...@nutch.apache.org; dev@nutch.apache.org Subject: Lucene SOLR Revolution Dublin Hi guys Anyone going to http://www.lucenerevolution.org/ http://www.lucenerevolution.org/ next week? Ill be giving a talk on Nutch, ping me on twitter or email if you want to meet up and have a chat. Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://www.digitalpebble.com http://twitter.com/digitalpebble http://twitter.com/digitalpebble