[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-10-29 Thread Roberto Gardenier (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roberto Gardenier updated NUTCH-585:


Comment: was deleted

(was: I have compiled nutch 1.5.1 with the provided plugin and used the 
configuration as described above. This all without success. 
Could anyone assist me on troubleshooting ?

Nutch crawls and SOLR indexes with success but the content field still includes 
content of which are supposed to be blacklisted.

Steps:
1. Patched Nutch 1.5.1. with above blacklist_whitelist_plugin.patch
2. Enabled the plugin in nutch-default.xml plugin.includes: 
index-blacklist-whitelist.
3. Added the new field strippedContent to schema.xml (both nutch and solr) !-- 
fields for the blacklist/whitelist plugin -- field name=strippedContent 
type=text stored=true indexed=true/.
4. Configured parser.html.blacklist to blacklist div.kruimelspoor in 
nutch-default.xml.

I pointed nutch at my site and fired it. I dont get warnings/errors or any kind 
of showstoppers, the crawling goes well and the index is filled. But still with 
everything inside div.kruimelspoor.
)

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1482:


 Summary: Rename HTMLParseFilter
 Key: NUTCH-1482
 URL: https://issues.apache.org/jira/browse/NUTCH-1482
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.5.1
Reporter: Julien Nioche


See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
better reflect what it does and I think we should do the same for 1.x.

any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486106#comment-13486106
 ] 

Lewis John McGibbney commented on NUTCH-1482:
-

Hi Julien. +1 for this

 Rename HTMLParseFilter
 --

 Key: NUTCH-1482
 URL: https://issues.apache.org/jira/browse/NUTCH-1482
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.5.1
Reporter: Julien Nioche

 See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
 better reflect what it does and I think we should do the same for 1.x.
 any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-578-TEST-1.patch

JUnit test to catch this problem and NUTCH-578: a large patch for a test but 
the idea is to extend it to test also other transitions of CrawlDatum states.

 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
 and is generated over and over again
 

 Key: NUTCH-1245
 URL: https://issues.apache.org/jira/browse/NUTCH-1245
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 1.6

 Attachments: NUTCH-1245-578-TEST-1.patch


 A document gone with 404 after db.fetch.interval.max (90 days) has passed
 is fetched over and over again but although fetch status is fetch_gone
 its status in CrawlDb keeps db_unfetched. Consequently, this document will
 be generated and fetched from now on in every cycle.
 To reproduce:
 # create a CrawlDatum in CrawlDb which retry interval hits 
 db.fetch.interval.max (I manipulated the shouldFetch() in 
 AbstractFetchSchedule to achieve this)
 # now this URL is fetched again
 # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
 db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
 days)
 # this does not change with every generate-fetch-update cycle, here for two 
 segments:
 {noformat}
 /tmp/testcrawl/segments/20120105161430
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:14:21 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:14:48 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 /tmp/testcrawl/segments/20120105161631
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:16:23 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:20:05 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 {noformat}
 As far as I can see it's caused by setPageGoneSchedule() in 
 AbstractFetchSchedule. Some pseudo-code:
 {code}
 setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
 datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
 maxInterval
 datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
 if (maxInterval  datum.fetchInterval) // necessarily true
forceRefetch()
 forceRefetch:
 if (datum.fetchInterval  maxInterval) // true because it's 1.35 * 
 maxInterval
datum.fetchInterval = 0.9 * maxInterval
 datum.status = db_unfetched // 
 shouldFetch (called from generate / Generator.map):
 if ((datum.fetchTime - curTime)  maxInterval)
// always true if the crawler is launched in short intervals
// (lower than 0.35 * maxInterval)
datum.fetchTime = curTime // forces a refetch
 {code}
 After setPageGoneSchedule is called via update the state is db_unfetched and 
 the retry interval 0.9 * db.fetch.interval.max (81 days). 
 Although the fetch time in the CrawlDb is far in the future
 {noformat}
 % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
 URL: http://localhost/page_gone
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Sun May 06 05:20:05 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Score: 1.0
 Signature: null
 Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
 {noformat}
 the URL is generated again because (fetch time - current time) is larger than 
 db.fetch.interval.max.
 The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
 the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
 It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.

[jira] [Assigned] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-10-29 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1370:
---

Assignee: Lewis John McGibbney

 Expose exact number of urls injected @runtime 
 --

 Key: NUTCH-1370
 URL: https://issues.apache.org/jira/browse/NUTCH-1370
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.2


 Example: When using trunk, currently we see 
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 I would like to see
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
 crawl/crawldb
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 This would make debugging easier and would help those who end up getting 
 {code}
 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
 for fetching, exiting ...
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: NUTCH-1370

2012-10-29 Thread Lewis John Mcgibbney
In addition to this. Can someone please explain why [0]
StorageUtils#getDataStoreClass is a private method in this class. The
reason I ask is that it would be nice to be able to log which Gora
class is being used to persist the Injected URLs.

Are there any security risks associated with making this class public
and accessible?

Thanks

Lewis

[0] 
https://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java

On Mon, Oct 29, 2012 at 4:22 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
 Hi,

 So I thought I'd take this one on tonight and see if I can resolve.
 Basically, my high level question is as follows...
 Is each line of a text file (seed file) which we attempt to inject
 into the webdb considered as an individual map task?
 The idea is to establish a counter for the successfully injected URLS
 (and possibly a counter for unsuccessful ones as well) so determining
 how many URLs are (or should be) present within the webdb can be
 determined after bootstrapping Nutch via the inject command.

 Thanks all

 Lewis

 --
 Lewis



-- 
Lewis


Re: NUTCH-1370

2012-10-29 Thread Julien Nioche
Hi Lewis

see comments below


 So I thought I'd take this one on tonight and see if I can resolve.
 Basically, my high level question is as follows...
 Is each line of a text file (seed file) which we attempt to inject
 into the webdb considered as an individual map task?


no - each file in a map task


 The idea is to establish a counter for the successfully injected URLS
 (and possibly a counter for unsuccessful ones as well) so determining
 how many URLs are (or should be) present within the webdb can be
 determined after bootstrapping Nutch via the inject command.

 you get this information from the Hadoop Mapreduce Admin - the number of
seeds is the Map input records of the first job, the number post filtering
and normalisation is in Map output records as for the final number of urls
in the crawldb post merging with whatever is in the Reduce Output Record.

Just get the values from the counters of these 2 jobs to display a user
friendly message in the log

In general I would advise anyone to use the pseudo distributed mode instead
of the local one as you get a lot more info from the Hadoop admin screen
and won't have to trawl through the log files.

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: NUTCH-1370

2012-10-29 Thread Lewis John Mcgibbney
Hi Julien,

Thanks for the comments. Any additional ones regarding the accessibility of
the getDataStoreClass?

Thanks again

Lewis

On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Lewis

 see comments below


 So I thought I'd take this one on tonight and see if I can resolve.
 Basically, my high level question is as follows...
 Is each line of a text file (seed file) which we attempt to inject
 into the webdb considered as an individual map task?


 no - each file in a map task


 The idea is to establish a counter for the successfully injected URLS
 (and possibly a counter for unsuccessful ones as well) so determining
 how many URLs are (or should be) present within the webdb can be
 determined after bootstrapping Nutch via the inject command.

 you get this information from the Hadoop Mapreduce Admin - the number of
 seeds is the Map input records of the first job, the number post
 filtering and normalisation is in Map output records as for the final
 number of urls in the crawldb post merging with whatever is in the Reduce
 Output Record.

 Just get the values from the counters of these 2 jobs to display a user
 friendly message in the log

 In general I would advise anyone to use the pseudo distributed mode
 instead of the local one as you get a lot more info from the Hadoop admin
 screen and won't have to trawl through the log files.

 HTH

 Julien


 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 
*Lewis*


[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-1.patch

FetchSchedule.setPageGoneSchedule is called exclusively for a fetch_gone in 
CrawlDbReducer.reduce. Is there a need to call forceRefetch just after a fetch 
leads to a fetch_gone (assumed there is little delay between fetch and 
updatedb)?

Attached patch sets the fetchInterval to db.fetch.interval.max and does not 
call forceRefetch.

 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
 and is generated over and over again
 

 Key: NUTCH-1245
 URL: https://issues.apache.org/jira/browse/NUTCH-1245
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 1.6

 Attachments: NUTCH-1245-1.patch, NUTCH-1245-578-TEST-1.patch


 A document gone with 404 after db.fetch.interval.max (90 days) has passed
 is fetched over and over again but although fetch status is fetch_gone
 its status in CrawlDb keeps db_unfetched. Consequently, this document will
 be generated and fetched from now on in every cycle.
 To reproduce:
 # create a CrawlDatum in CrawlDb which retry interval hits 
 db.fetch.interval.max (I manipulated the shouldFetch() in 
 AbstractFetchSchedule to achieve this)
 # now this URL is fetched again
 # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
 db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
 days)
 # this does not change with every generate-fetch-update cycle, here for two 
 segments:
 {noformat}
 /tmp/testcrawl/segments/20120105161430
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:14:21 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:14:48 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 /tmp/testcrawl/segments/20120105161631
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:16:23 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:20:05 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 {noformat}
 As far as I can see it's caused by setPageGoneSchedule() in 
 AbstractFetchSchedule. Some pseudo-code:
 {code}
 setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
 datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
 maxInterval
 datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
 if (maxInterval  datum.fetchInterval) // necessarily true
forceRefetch()
 forceRefetch:
 if (datum.fetchInterval  maxInterval) // true because it's 1.35 * 
 maxInterval
datum.fetchInterval = 0.9 * maxInterval
 datum.status = db_unfetched // 
 shouldFetch (called from generate / Generator.map):
 if ((datum.fetchTime - curTime)  maxInterval)
// always true if the crawler is launched in short intervals
// (lower than 0.35 * maxInterval)
datum.fetchTime = curTime // forces a refetch
 {code}
 After setPageGoneSchedule is called via update the state is db_unfetched and 
 the retry interval 0.9 * db.fetch.interval.max (81 days). 
 Although the fetch time in the CrawlDb is far in the future
 {noformat}
 % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
 URL: http://localhost/page_gone
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Sun May 06 05:20:05 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Score: 1.0
 Signature: null
 Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
 {noformat}
 the URL is generated again because (fetch time - current time) is larger than 
 db.fetch.interval.max.
 The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
 the fetch time 

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486155#comment-13486155
 ] 

Markus Jelsma commented on NUTCH-1482:
--

+0 I'm fine with such a change but this will break all (custom) plugins. I am 
not sure some users will be very happy with this change as all references to it 
in plugin.xml and the classes must be updated.

 Rename HTMLParseFilter
 --

 Key: NUTCH-1482
 URL: https://issues.apache.org/jira/browse/NUTCH-1482
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.5.1
Reporter: Julien Nioche

 See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
 better reflect what it does and I think we should do the same for 1.x.
 any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486290#comment-13486290
 ] 

Sebastian Nagel commented on NUTCH-1482:


Markus, you are right: I remember the API change of HTMLParseFilter in 1.0: it 
took me some hours to get the custom plugins compiled.
- is it possible to deprecate the extension point and keep it for some time?
- at least, place a warning in CHANGES.txt with a link to update instructions 
in the wiki

 Rename HTMLParseFilter
 --

 Key: NUTCH-1482
 URL: https://issues.apache.org/jira/browse/NUTCH-1482
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.5.1
Reporter: Julien Nioche

 See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
 better reflect what it does and I think we should do the same for 1.x.
 any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-2.patch
NUTCH-1245-578-TEST-2.patch

Improved patches

 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
 and is generated over and over again
 

 Key: NUTCH-1245
 URL: https://issues.apache.org/jira/browse/NUTCH-1245
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 1.6

 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
 NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch


 A document gone with 404 after db.fetch.interval.max (90 days) has passed
 is fetched over and over again but although fetch status is fetch_gone
 its status in CrawlDb keeps db_unfetched. Consequently, this document will
 be generated and fetched from now on in every cycle.
 To reproduce:
 # create a CrawlDatum in CrawlDb which retry interval hits 
 db.fetch.interval.max (I manipulated the shouldFetch() in 
 AbstractFetchSchedule to achieve this)
 # now this URL is fetched again
 # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
 db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
 days)
 # this does not change with every generate-fetch-update cycle, here for two 
 segments:
 {noformat}
 /tmp/testcrawl/segments/20120105161430
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:14:21 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:14:48 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 /tmp/testcrawl/segments/20120105161631
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:16:23 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:20:05 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 {noformat}
 As far as I can see it's caused by setPageGoneSchedule() in 
 AbstractFetchSchedule. Some pseudo-code:
 {code}
 setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
 datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
 maxInterval
 datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
 if (maxInterval  datum.fetchInterval) // necessarily true
forceRefetch()
 forceRefetch:
 if (datum.fetchInterval  maxInterval) // true because it's 1.35 * 
 maxInterval
datum.fetchInterval = 0.9 * maxInterval
 datum.status = db_unfetched // 
 shouldFetch (called from generate / Generator.map):
 if ((datum.fetchTime - curTime)  maxInterval)
// always true if the crawler is launched in short intervals
// (lower than 0.35 * maxInterval)
datum.fetchTime = curTime // forces a refetch
 {code}
 After setPageGoneSchedule is called via update the state is db_unfetched and 
 the retry interval 0.9 * db.fetch.interval.max (81 days). 
 Although the fetch time in the CrawlDb is far in the future
 {noformat}
 % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
 URL: http://localhost/page_gone
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Sun May 06 05:20:05 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Score: 1.0
 Signature: null
 Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
 {noformat}
 the URL is generated again because (fetch time - current time) is larger than 
 db.fetch.interval.max.
 The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
 the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
 It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent 

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486484#comment-13486484
 ] 

Sebastian Nagel commented on NUTCH-578:
---

NUTCH-1245 provides a test to catch this problem.

Attached v5 patch:
* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit 
and status is set to db_gone. All attached patches do this: it will set the 
fetchInterval to a value larger than one day, so that from now on the URL is 
not fetched again and again.
* reset the retry counter in setPageGoneSchedule so that it cannot overflow and 
to get again 3 trials after db.max.fetch.interval is reached.

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
 NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, 
 NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-578:
--

Attachment: NUTCH-578_v5.patch

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
 NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, 
 NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira