[jira] [Commented] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
[ https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770856#comment-17770856 ] ASF GitHub Bot commented on NUTCH-3011: --- sebastian-nagel opened a new pull request, #786: URL: https://github.com/apache/nutch/pull/786 (no comment) > HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors > (HTTP 5xx) > > > Key: NUTCH-3011 > URL: https://issues.apache.org/jira/browse/NUTCH-3011 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server > errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. > See also NUTCH-2573 and > https://support.google.com/webmasters/answer/9679690#robots_details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [nutch] sebastian-nagel opened a new pull request, #786: NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
sebastian-nagel opened a new pull request, #786: URL: https://github.com/apache/nutch/pull/786 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
Sebastian Nagel created NUTCH-3011: -- Summary: HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) Key: NUTCH-3011 URL: https://issues.apache.org/jira/browse/NUTCH-3011 Project: Nutch Issue Type: Improvement Affects Versions: 1.19 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.20 HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. See also NUTCH-2573 and https://support.google.com/webmasters/answer/9679690#robots_details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1373. -- > Implement consistent execution of normalising and filtering in Generator > > > Key: NUTCH-1373 > URL: https://issues.apache.org/jira/browse/NUTCH-1373 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.4 >Reporter: Lewis John McGibbney >Priority: Minor > > As per discussion here [0] this issue should address the inconsistencies we > see in the scheduled execution of normalising and filtering between Nutchgora > Generator Mapper and trunk Generator mapper/reducer. > Hopefully we can come to some consensus as to the best approach acorss both > dists. > [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1373. Resolution: Abandoned Closing as Nutch 2.x (aka. nutchgora) isn't maintained anymore. > Implement consistent execution of normalising and filtering in Generator > > > Key: NUTCH-1373 > URL: https://issues.apache.org/jira/browse/NUTCH-1373 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.4 >Reporter: Lewis John McGibbney >Priority: Minor > > As per discussion here [0] this issue should address the inconsistencies we > see in the scheduled execution of normalising and filtering between Nutchgora > Generator Mapper and trunk Generator mapper/reducer. > Hopefully we can come to some consensus as to the best approach acorss both > dists. > [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-1374) Workaround for license headers
[ https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770833#comment-17770833 ] Sebastian Nagel commented on NUTCH-1374: The package.html files were replaced by package-info.java containing a license header in NUTCH-2849 > Workaround for license headers > -- > > Key: NUTCH-1374 > URL: https://issues.apache.org/jira/browse/NUTCH-1374 > Project: Nutch > Issue Type: Task > Components: documentation >Affects Versions: 1.4, nutchgora >Reporter: Lewis John McGibbney >Priority: Major > > Currently in both versions of Nutch we have two types of files which DO NOT > contain license headers; namely all package.html files and the test files > within the language detection plugin. On my initial tests, adding license > headers to the language test files breaks the tests so we need to find a > workaround (or the correct synatx) to add commented out license headers to > these files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770831#comment-17770831 ] Sebastian Nagel commented on NUTCH-1635: Hi [~markus17], did this continue to happen in the last years? Esp., after upgrading the MapReduce API (NUTCH-2375). > New crawldb sometimes ends up in current > > > Key: NUTCH-1635 > URL: https://issues.apache.org/jira/browse/NUTCH-1635 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Priority: Major > > In some weird cases the newly created crawldb by updatedb ends up in > crawl/crawldb/current//. So instead of replacing current/, it ends up > inside current/! This causes the generator to fail. > It's impossible to reliably reproduce the problem. It only happened a couple > of times in the last few years. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1947. Resolution: Abandoned Closing because OutlinkExtractor has seen many updates since then: upgrade to Java 8, replacement of Apache ORO to java.util.regex, etc. > Overhaul o.a.n.parse.OutlinkExtractor.java > --- > > Key: NUTCH-1947 > URL: https://issues.apache.org/jira/browse/NUTCH-1947 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.3, 1.9 >Reporter: Lewis John McGibbney >Priority: Major > > Right now in both trunk and 2.X, the > [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java] > class need a bit of TLC. It is referencing JDK1.5 in a few places, there are > misleading URL entries and it boasts some interesting @Deprecation methods > which we could ideally remove. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1947. -- > Overhaul o.a.n.parse.OutlinkExtractor.java > --- > > Key: NUTCH-1947 > URL: https://issues.apache.org/jira/browse/NUTCH-1947 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.3, 1.9 >Reporter: Lewis John McGibbney >Priority: Major > > Right now in both trunk and 2.X, the > [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java] > class need a bit of TLC. It is referencing JDK1.5 in a few places, there are > misleading URL entries and it boasts some interesting @Deprecation methods > which we could ideally remove. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2053. Resolution: Abandoned Closing this old issue (8 years), assuming that dependencies have been updated and cleaned up multiple times since then. > Uncessary dependencies included in ivy.xml (post NUTCH-2038) > > > Key: NUTCH-2053 > URL: https://issues.apache.org/jira/browse/NUTCH-2053 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Priority: Major > > Currently in trunk we have an unnecessary dependency included within > ivy/ivy.xml > https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101 > This needs to be removed. > [~asitang] can you please provide context as to why this is OK? I don't want > to break your code so sorry for lack of understanding. Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2053. -- > Uncessary dependencies included in ivy.xml (post NUTCH-2038) > > > Key: NUTCH-2053 > URL: https://issues.apache.org/jira/browse/NUTCH-2053 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Priority: Major > > Currently in trunk we have an unnecessary dependency included within > ivy/ivy.xml > https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101 > This needs to be removed. > [~asitang] can you please provide context as to why this is OK? I don't want > to break your code so sorry for lack of understanding. Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2423) Update contributor info page
[ https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2423. Fix Version/s: (was: 1.20) Resolution: Fixed The wiki pages were updated in 2020 and 2021. Thanks for reporting, [~krichter] ! > Update contributor info page > > > Key: NUTCH-2423 > URL: https://issues.apache.org/jira/browse/NUTCH-2423 > Project: Nutch > Issue Type: Task > Components: documentation, wiki >Reporter: Karl-Philipp Richter >Priority: Major > Labels: easytask, help-wanted > > The [contributor info > page](https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer) still > mentions subversion as SCM which I assume is obsolete because there's > git://git.apache.org/nutch.git. It should mention how the devs with write > access deal with pull/merge requests in general or on different popular > platforms (the information that they're not accepted is valuable as well). -- This message was sent by Atlassian Jira (v8.20.10#820010)