[jira] [Commented] (NUTCH-1296) nutchgora fetcher does not show correct 'threads' and 'resuming' properties
[ https://issues.apache.org/jira/browse/NUTCH-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221508#comment-13221508 ] Hudson commented on NUTCH-1296: --- Integrated in Nutch-nutchgora #181 (See [https://builds.apache.org/job/Nutch-nutchgora/181/]) NUTCH-1296 nutchgora fetcher does not show correct 'threads' and 'resuming' properties (Revision 1296203) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java > nutchgora fetcher does not show correct 'threads' and 'resuming' properties > --- > > Key: NUTCH-1296 > URL: https://issues.apache.org/jira/browse/NUTCH-1296 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema >Priority: Trivial > Fix For: nutchgora > > > The nutchgora FetcherJob logs the 'threads' and 'resuming' properties just > before fetching, but they are read from the config. (Ignoring the fact that > they are specified as parameters too. These paramaters are later set on the > config). > Trivial fix will be right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down
[ https://issues.apache.org/jira/browse/NUTCH-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221509#comment-13221509 ] Hudson commented on NUTCH-1295: --- Integrated in Nutch-nutchgora #181 (See [https://builds.apache.org/job/Nutch-nutchgora/181/]) NUTCH-1295 nutchgora restlet dependencies failing when remote repos is down (Revision 1296114) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/ivy/ivysettings.xml > nutchgora restlet dependencies failing when remote repos is down > > > Key: NUTCH-1295 > URL: https://issues.apache.org/jira/browse/NUTCH-1295 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema > Attachments: NUTCH-1295.patch > > > Currently the head of nutchgora cannot be build when running "ant clean > runtime". This is because the restlet dependencies cannot be found. This is > even though there are local restlet copies in the ivy2 cache dir. Did we not > have this problem before? > Anyway I found a solution. Basically I renamed the resolver name from the > chain name. This way the restlet dependencies are read from the local cache > when the remote one is not available. See patch for details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1292) Better exception logging and debugging during fetch.
[ https://issues.apache.org/jira/browse/NUTCH-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221506#comment-13221506 ] Hudson commented on NUTCH-1292: --- Integrated in Nutch-nutchgora #181 (See [https://builds.apache.org/job/Nutch-nutchgora/181/]) NUTCH-1292 Better exception logging and debugging during fetch. (Revision 1296239) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetchEntry.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/scoring/ScoreDatum.java > Better exception logging and debugging during fetch. > > > Key: NUTCH-1292 > URL: https://issues.apache.org/jira/browse/NUTCH-1292 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Priority: Trivial > Fix For: nutchgora > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1263) FetcherJob must put 'fetchTime' on input
[ https://issues.apache.org/jira/browse/NUTCH-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221507#comment-13221507 ] Hudson commented on NUTCH-1263: --- Integrated in Nutch-nutchgora #181 (See [https://builds.apache.org/job/Nutch-nutchgora/181/]) NUTCH-1263 FetcherJob must put 'fetchTime' on input (Revision 1296236) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java > FetcherJob must put 'fetchTime' on input > > > Key: NUTCH-1263 > URL: https://issues.apache.org/jira/browse/NUTCH-1263 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: nutchgora >Reporter: Ferdy Galema >Priority: Minor > Fix For: nutchgora > > Attachments: NUTCH-1263.patch > > > The reducer of the fetcher reads the field fetchTime, but does not include in > on the input. Trivial patch fixes this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-475) Adaptive crawl delay
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-475: --- Attachment: NUTCH-475.patch Updated patch which brings this issue up to speed as of Dogacan's comments. None of Todd's work was ever uploaded, however I think we should work towards an implementation as Enis' suggested. I suppose we can try/test this implementation... as I have not done so as of yet. > Adaptive crawl delay > > > Key: NUTCH-475 > URL: https://issues.apache.org/jira/browse/NUTCH-475 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Doğacan Güney > Attachments: NUTCH-475.patch, adaptive-delay_draft.patch > > > Current fetcher implementation waits a default interval before making another > request to the same server (if crawl-delay is not specified in robots.txt). > IMHO, an adaptive implementation will be better. If the server is under > little load and can server requests fast, then fetcher can ask for more pages > in a given interval. Similarly, if the server is suffering from heavy load, > fetcher can slow down(w.r.t that host), easing the load on the server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220984#comment-13220984 ] Ferdy Galema commented on NUTCH-1253: - I'll give this one a go.. > Incompatible neko and xerces versions > - > > Key: NUTCH-1253 > URL: https://issues.apache.org/jira/browse/NUTCH-1253 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4 > Environment: Ubuntu 10.04 >Reporter: Dennis Spathis > Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch > > > The Nutch 1.4 distribution includes > - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- > nekohtml) > - xercesImpl-2.9.1.jar (under .../runtime/local/lib) > These two JARs appear to be incompatible versions. When the HtmlParser > (configured to use neko) is invoked during a local-mode crawl, the parse > fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, > rebuild the HtmlParser plugin and add a > catch(Throwable) clause in the getParse method to log the stacktrace.) > I found that substituting a later, compatible version of nekohtml (1.9.11) > fixes the problem. > Curiously, and in support of the above, the nekohtml plugin.xml file in > Nutch 1.4 contains the following: > id="lib-nekohtml" >name="CyberNeko HTML Parser" >version="1.9.11" >provider-name="org.cyberneko"> > > > > > > > Note the conflicting version numbers (version tag is "1.9.11" but the > specified library is "nekohtml-0.9.5.jar"). > Was the 0.9.5 version included by mistake? Was the intention rather to > include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1292) Better exception logging and debugging during fetch.
[ https://issues.apache.org/jira/browse/NUTCH-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1292. --- Resolution: Fixed committed > Better exception logging and debugging during fetch. > > > Key: NUTCH-1292 > URL: https://issues.apache.org/jira/browse/NUTCH-1292 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Priority: Trivial > Fix For: nutchgora > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1263) FetcherJob must put 'fetchTime' on input
[ https://issues.apache.org/jira/browse/NUTCH-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1263. --- Resolution: Fixed Fix Version/s: nutchgora This one slipped under the radar. Committed. > FetcherJob must put 'fetchTime' on input > > > Key: NUTCH-1263 > URL: https://issues.apache.org/jira/browse/NUTCH-1263 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: nutchgora >Reporter: Ferdy Galema >Priority: Minor > Fix For: nutchgora > > Attachments: NUTCH-1263.patch > > > The reducer of the fetcher reads the field fetchTime, but does not include in > on the input. Trivial patch fixes this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Nutch with Letor
Also please4 hip this discussion to user@ as it seems to be more relevant there. Thanks On Fri, Mar 2, 2012 at 2:13 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi, > > Would be great if you could provide some links to the dataset, exactly > what it is etc. > > Thank you > > > On Fri, Mar 2, 2012 at 1:19 PM, varunpandeyengg > wrote: > >> Hey Guys, >> >> I am new to Nutch. I am part of a IR research team & need to create a >> setup >> where in I need to crawl Microsoft's LETOR Dataset with Nutch. After >> googling for a while, I didn't get any tutorial or help. Could anyone >> guide >> me for the same? >> >> I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7. >> >> Till now I am able to crawl public network from my Nutch setup integrated >> with Eclipse... >> >> Thanks in advance. >> >> - >> Varun >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3793432p3793432.html >> Sent from the Nutch - Dev mailing list archive at Nabble.com. >> > > > > -- > *Lewis* > > -- *Lewis*
Re: Nutch with Letor
Hi, Would be great if you could provide some links to the dataset, exactly what it is etc. Thank you On Fri, Mar 2, 2012 at 1:19 PM, varunpandeyengg wrote: > Hey Guys, > > I am new to Nutch. I am part of a IR research team & need to create a setup > where in I need to crawl Microsoft's LETOR Dataset with Nutch. After > googling for a while, I didn't get any tutorial or help. Could anyone guide > me for the same? > > I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7. > > Till now I am able to crawl public network from my Nutch setup integrated > with Eclipse... > > Thanks in advance. > > - > Varun > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3793432p3793432.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. > -- *Lewis*
[jira] [Closed] (NUTCH-1296) nutchgora fetcher does not show correct 'threads' and 'resuming' properties
[ https://issues.apache.org/jira/browse/NUTCH-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1296. --- Resolution: Fixed committed > nutchgora fetcher does not show correct 'threads' and 'resuming' properties > --- > > Key: NUTCH-1296 > URL: https://issues.apache.org/jira/browse/NUTCH-1296 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema >Priority: Trivial > Fix For: nutchgora > > > The nutchgora FetcherJob logs the 'threads' and 'resuming' properties just > before fetching, but they are read from the config. (Ignoring the fact that > they are specified as parameters too. These paramaters are later set on the > config). > Trivial fix will be right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1296) nutchgora fetcher does not show correct 'threads' and 'resuming' properties
nutchgora fetcher does not show correct 'threads' and 'resuming' properties --- Key: NUTCH-1296 URL: https://issues.apache.org/jira/browse/NUTCH-1296 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Priority: Trivial Fix For: nutchgora The nutchgora FetcherJob logs the 'threads' and 'resuming' properties just before fetching, but they are read from the config. (Ignoring the fact that they are specified as parameters too. These paramaters are later set on the config). Trivial fix will be right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Nutch with Letor
Hey Guys, I am new to Nutch. I am part of a IR research team & need to create a setup where in I need to crawl Microsoft's LETOR Dataset with Nutch. After googling for a while, I didn't get any tutorial or help. Could anyone guide me for the same? I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7. Till now I am able to crawl public network from my Nutch setup integrated with Eclipse... Thanks in advance. - Varun -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3793432p3793432.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay
Hi Andrzej, On Fri, Mar 2, 2012 at 12:37 PM, Andrzej Bialecki wrote: > Fetcher2 is the current Fetcher. The original Fetcher was temporarily > renamed OldFetcher and then removed. > So looks like this 'might' be more straight forward to implement than I originally thought. When I get a bit of time I would like to dive into it. Thanks
[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1273: Attachment: NUTCH-1273-v2-trunk.patch This patch goes some length to address the issues described on user or dev list. I'm having some problems with Exceptions, and tbh not really sure about the new API construction. I opted to switch the MimeUtil#autoResolveContentType code to use a mimetype String as oppose to either * Switch the code to use MediaType rather than MimeType, and call DefaultDetector directly (rather than using the Tika facade class) * If we get back a String (not null) for the mimetype, create a MimeType object for it. In all honesty, if the method I have used is not suitable then I think the latter of the above alternatives would be better simply because we arwe not currently calling MediaType anywhere, I've been trying to keeep with consistency when workin on this one. If someone could have a look it would be greatly appreciated. Thanks > Fix [deprecation] javac warnings > > > Key: NUTCH-1273 > URL: https://issues.apache.org/jira/browse/NUTCH-1273 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: nutchgora, 1.5 > > Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, > NUTCH-1273-v2-trunk.patch > > > As part of this task, these warnings should be resolved, however this > particular strand of warnings can either be resolved by adding > {code} > @SuppressWarnings("deprecation") > {code} > or by actually upgrading our class usage to rely upon non-deprecated classes. > Which option is more appropriate for the project? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay
On 02/03/2012 12:45, Lewis John Mcgibbney wrote: Hi Guys, As there were some comments on the user list, I recently got digging with http redirects then stumbled across NUTCH-1042. Although these are individual issues e.g. redirects and crawl delays, I think they are certainly linked, however what is interesting is that users 'usually' don't consider them to be interlinked as such and therefore struggle to debug how and why either the redirect or the crawl delay pages are not being fetched. Doing some more digging I found the now rather old and tatty NUTCH-475, which obviously got me thinking about how we maintain the AdaptiveFetchSchedule for custom refetching. Now I begin to start thinking about the following - Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042 still needs fixed as this is obviously becoming a bit of a pain for some users. Yes. - Can someone shine some light on what happened to Fetcher2.java that Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0) Fetcher2 is the current Fetcher. The original Fetcher was temporarily renamed OldFetcher and then removed. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay
Hi Guys, As there were some comments on the user list, I recently got digging with http redirects then stumbled across NUTCH-1042. Although these are individual issues e.g. redirects and crawl delays, I think they are certainly linked, however what is interesting is that users 'usually' don't consider them to be interlinked as such and therefore struggle to debug how and why either the redirect or the crawl delay pages are not being fetched. Doing some more digging I found the now rather old and tatty NUTCH-475, which obviously got me thinking about how we maintain the AdaptiveFetchSchedule for custom refetching. Now I begin to start thinking about the following - Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042 still needs fixed as this is obviously becoming a bit of a pain for some users. - Can someone shine some light on what happened to Fetcher2.java that Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0) - For you guys managing/running/maintaining your own (and possibly clients) web servers, what are the perceptions of maintaining your own AdaptiveCrawlDelay? Pro's and Con's (apart from the obvious) I can't really think of anything else at the moment! Thanks Lewis -- *Lewis*
[jira] [Closed] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down
[ https://issues.apache.org/jira/browse/NUTCH-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1295. --- Resolution: Fixed committed > nutchgora restlet dependencies failing when remote repos is down > > > Key: NUTCH-1295 > URL: https://issues.apache.org/jira/browse/NUTCH-1295 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema > Attachments: NUTCH-1295.patch > > > Currently the head of nutchgora cannot be build when running "ant clean > runtime". This is because the restlet dependencies cannot be found. This is > even though there are local restlet copies in the ivy2 cache dir. Did we not > have this problem before? > Anyway I found a solution. Basically I renamed the resolver name from the > chain name. This way the restlet dependencies are read from the local cache > when the remote one is not available. See patch for details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down
[ https://issues.apache.org/jira/browse/NUTCH-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1295: Attachment: NUTCH-1295.patch > nutchgora restlet dependencies failing when remote repos is down > > > Key: NUTCH-1295 > URL: https://issues.apache.org/jira/browse/NUTCH-1295 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema > Attachments: NUTCH-1295.patch > > > Currently the head of nutchgora cannot be build when running "ant clean > runtime". This is because the restlet dependencies cannot be found. This is > even though there are local restlet copies in the ivy2 cache dir. Did we not > have this problem before? > Anyway I found a solution. Basically I renamed the resolver name from the > chain name. This way the restlet dependencies are read from the local cache > when the remote one is not available. See patch for details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down
nutchgora restlet dependencies failing when remote repos is down Key: NUTCH-1295 URL: https://issues.apache.org/jira/browse/NUTCH-1295 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Currently the head of nutchgora cannot be build when running "ant clean runtime". This is because the restlet dependencies cannot be found. This is even though there are local restlet copies in the ivy2 cache dir. Did we not have this problem before? Anyway I found a solution. Basically I renamed the resolver name from the chain name. This way the restlet dependencies are read from the local cache when the remote one is not available. See patch for details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220787#comment-13220787 ] Markus Jelsma edited comment on NUTCH-1024 at 3/2/12 9:05 AM: -- New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else. This is useful for sites where you want to use AdaptiveFetchSchedule but still want the generator to select an injected homepage every N hours. was (Author: markus17): New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else. > Dynamically set fetchInterval by MIME-type > -- > > Key: NUTCH-1024 > URL: https://issues.apache.org/jira/browse/NUTCH-1024 > Project: Nutch > Issue Type: New Feature > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.5 > > Attachments: AdaptiveFetchSchedule.patch, > MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, > adaptive-mimetypes.txt > > > Add facility to configure default or fixed fetchInterval values by MIME-type. > This is useful for conserving resources for files that are known to change > frequently or never and everything in between. > * simple key\tvalue\n configuration file > * only set fetchInterval for new documents > * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Attachment: NUTCH-1024-1.5-1.patch New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else. > Dynamically set fetchInterval by MIME-type > -- > > Key: NUTCH-1024 > URL: https://issues.apache.org/jira/browse/NUTCH-1024 > Project: Nutch > Issue Type: New Feature > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.5 > > Attachments: AdaptiveFetchSchedule.patch, > MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, > adaptive-mimetypes.txt > > > Add facility to configure default or fixed fetchInterval values by MIME-type. > This is useful for conserving resources for files that are known to change > frequently or never and everything in between. > * simple key\tvalue\n configuration file > * only set fetchInterval for new documents > * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira