[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roberto Gardenier updated NUTCH-585: Comment: was deleted (was: I have compiled nutch 1.5.1 with the provided plugin and used the configuration as described above. This all without success. Could anyone assist me on troubleshooting ? Nutch crawls and SOLR indexes with success but the content field still includes content of which are supposed to be blacklisted. Steps: 1. Patched Nutch 1.5.1. with above blacklist_whitelist_plugin.patch 2. Enabled the plugin in nutch-default.xml plugin.includes: index-blacklist-whitelist. 3. Added the new field strippedContent to schema.xml (both nutch and solr) !-- fields for the blacklist/whitelist plugin -- field name=strippedContent type=text stored=true indexed=true/. 4. Configured parser.html.blacklist to blacklist div.kruimelspoor in nutch-default.xml. I pointed nutch at my site and fired it. I dont get warnings/errors or any kind of showstoppers, the crawling goes well and the index is filled. But still with everything inside div.kruimelspoor. ) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1482) Rename HTMLParseFilter
Julien Nioche created NUTCH-1482: Summary: Rename HTMLParseFilter Key: NUTCH-1482 URL: https://issues.apache.org/jira/browse/NUTCH-1482 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.5.1 Reporter: Julien Nioche See NUTCH-861 for a background discussion. We have changed the name in 2.x to better reflect what it does and I think we should do the same for 1.x. any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486106#comment-13486106 ] Lewis John McGibbney commented on NUTCH-1482: - Hi Julien. +1 for this Rename HTMLParseFilter -- Key: NUTCH-1482 URL: https://issues.apache.org/jira/browse/NUTCH-1482 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.5.1 Reporter: Julien Nioche See NUTCH-861 for a background discussion. We have changed the name in 2.x to better reflect what it does and I think we should do the same for 1.x. any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-578-TEST-1.patch JUnit test to catch this problem and NUTCH-578: a large patch for a test but the idea is to extend it to test also other transitions of CrawlDatum states. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 Attachments: NUTCH-1245-578-TEST-1.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA.
[jira] [Assigned] (NUTCH-1370) Expose exact number of urls injected @runtime
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1370: --- Assignee: Lewis John McGibbney Expose exact number of urls injected @runtime -- Key: NUTCH-1370 URL: https://issues.apache.org/jira/browse/NUTCH-1370 Project: Nutch Issue Type: Improvement Components: injector Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.6, 2.2 Example: When using trunk, currently we see {code} 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at 2012-05-22 09:04:00 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: {code} I would like to see {code} 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at 2012-05-22 09:04:00 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Injected N urls to crawl/crawldb 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: {code} This would make debugging easier and would help those who end up getting {code} 2012-05-22 09:04:04,850 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ... {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: NUTCH-1370
In addition to this. Can someone please explain why [0] StorageUtils#getDataStoreClass is a private method in this class. The reason I ask is that it would be nice to be able to log which Gora class is being used to persist the Injected URLs. Are there any security risks associated with making this class public and accessible? Thanks Lewis [0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java On Mon, Oct 29, 2012 at 4:22 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, So I thought I'd take this one on tonight and see if I can resolve. Basically, my high level question is as follows... Is each line of a text file (seed file) which we attempt to inject into the webdb considered as an individual map task? The idea is to establish a counter for the successfully injected URLS (and possibly a counter for unsuccessful ones as well) so determining how many URLs are (or should be) present within the webdb can be determined after bootstrapping Nutch via the inject command. Thanks all Lewis -- Lewis -- Lewis
Re: NUTCH-1370
Hi Lewis see comments below So I thought I'd take this one on tonight and see if I can resolve. Basically, my high level question is as follows... Is each line of a text file (seed file) which we attempt to inject into the webdb considered as an individual map task? no - each file in a map task The idea is to establish a counter for the successfully injected URLS (and possibly a counter for unsuccessful ones as well) so determining how many URLs are (or should be) present within the webdb can be determined after bootstrapping Nutch via the inject command. you get this information from the Hadoop Mapreduce Admin - the number of seeds is the Map input records of the first job, the number post filtering and normalisation is in Map output records as for the final number of urls in the crawldb post merging with whatever is in the Reduce Output Record. Just get the values from the counters of these 2 jobs to display a user friendly message in the log In general I would advise anyone to use the pseudo distributed mode instead of the local one as you get a lot more info from the Hadoop admin screen and won't have to trawl through the log files. HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: NUTCH-1370
Hi Julien, Thanks for the comments. Any additional ones regarding the accessibility of the getDataStoreClass? Thanks again Lewis On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Lewis see comments below So I thought I'd take this one on tonight and see if I can resolve. Basically, my high level question is as follows... Is each line of a text file (seed file) which we attempt to inject into the webdb considered as an individual map task? no - each file in a map task The idea is to establish a counter for the successfully injected URLS (and possibly a counter for unsuccessful ones as well) so determining how many URLs are (or should be) present within the webdb can be determined after bootstrapping Nutch via the inject command. you get this information from the Hadoop Mapreduce Admin - the number of seeds is the Map input records of the first job, the number post filtering and normalisation is in Map output records as for the final number of urls in the crawldb post merging with whatever is in the Reduce Output Record. Just get the values from the counters of these 2 jobs to display a user friendly message in the log In general I would advise anyone to use the pseudo distributed mode instead of the local one as you get a lot more info from the Hadoop admin screen and won't have to trawl through the log files. HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- *Lewis*
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-1.patch FetchSchedule.setPageGoneSchedule is called exclusively for a fetch_gone in CrawlDbReducer.reduce. Is there a need to call forceRefetch just after a fetch leads to a fetch_gone (assumed there is little delay between fetch and updatedb)? Attached patch sets the fetchInterval to db.fetch.interval.max and does not call forceRefetch. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 Attachments: NUTCH-1245-1.patch, NUTCH-1245-578-TEST-1.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time
[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486155#comment-13486155 ] Markus Jelsma commented on NUTCH-1482: -- +0 I'm fine with such a change but this will break all (custom) plugins. I am not sure some users will be very happy with this change as all references to it in plugin.xml and the classes must be updated. Rename HTMLParseFilter -- Key: NUTCH-1482 URL: https://issues.apache.org/jira/browse/NUTCH-1482 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.5.1 Reporter: Julien Nioche See NUTCH-861 for a background discussion. We have changed the name in 2.x to better reflect what it does and I think we should do the same for 1.x. any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486290#comment-13486290 ] Sebastian Nagel commented on NUTCH-1482: Markus, you are right: I remember the API change of HTMLParseFilter in 1.0: it took me some hours to get the custom plugins compiled. - is it possible to deprecate the extension point and keep it for some time? - at least, place a warning in CHANGES.txt with a link to update instructions in the wiki Rename HTMLParseFilter -- Key: NUTCH-1482 URL: https://issues.apache.org/jira/browse/NUTCH-1482 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.5.1 Reporter: Julien Nioche See NUTCH-861 for a background discussion. We have changed the name in 2.x to better reflect what it does and I think we should do the same for 1.x. any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-2.patch NUTCH-1245-578-TEST-2.patch Improved patches URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent
[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486484#comment-13486484 ] Sebastian Nagel commented on NUTCH-578: --- NUTCH-1245 provides a test to catch this problem. Attached v5 patch: * call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit and status is set to db_gone. All attached patches do this: it will set the fetchInterval to a value larger than one day, so that from now on the URL is not fetched again and again. * reset the retry counter in setPageGoneSchedule so that it cannot overflow and to get again 3 trials after db.max.fetch.interval is reached. URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Markus Jelsma Fix For: 1.6 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-578: -- Attachment: NUTCH-578_v5.patch URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Markus Jelsma Fix For: 1.6 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira