[jira] [Commented] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770856#comment-17770856
 ] 

ASF GitHub Bot commented on NUTCH-3011:
---

sebastian-nagel opened a new pull request, #786:
URL: https://github.com/apache/nutch/pull/786

   (no comment)




> HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
> (HTTP 5xx)
> 
>
> Key: NUTCH-3011
> URL: https://issues.apache.org/jira/browse/NUTCH-3011
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server 
> errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. 
> See also NUTCH-2573 and 
> https://support.google.com/webmasters/answer/9679690#robots_details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel opened a new pull request, #786: NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-01 Thread via GitHub


sebastian-nagel opened a new pull request, #786:
URL: https://github.com/apache/nutch/pull/786

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-01 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3011:
--

 Summary: HttpRobotRulesParser: handle HTTP 429 Too Many Requests 
same as server errors (HTTP 5xx)
 Key: NUTCH-3011
 URL: https://issues.apache.org/jira/browse/NUTCH-3011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.20


HttpRobotRulesParser should handle HTTP 429 Too Many Requests same as server 
errors (HTTP 5xx), that is if configured signalize Fetcher to delay requests. 
See also NUTCH-2573 and 
https://support.google.com/webmasters/answer/9679690#robots_details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1373.
--

> Implement consistent execution of normalising and filtering in Generator
> 
>
> Key: NUTCH-1373
> URL: https://issues.apache.org/jira/browse/NUTCH-1373
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> As per discussion here [0] this issue should address the inconsistencies we 
> see in the scheduled execution of normalising and filtering between Nutchgora 
> Generator Mapper and trunk Generator mapper/reducer.
> Hopefully we can come to some consensus as to the best approach acorss both 
> dists. 
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1373.

Resolution: Abandoned

Closing as Nutch 2.x (aka. nutchgora) isn't maintained anymore.

> Implement consistent execution of normalising and filtering in Generator
> 
>
> Key: NUTCH-1373
> URL: https://issues.apache.org/jira/browse/NUTCH-1373
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> As per discussion here [0] this issue should address the inconsistencies we 
> see in the scheduled execution of normalising and filtering between Nutchgora 
> Generator Mapper and trunk Generator mapper/reducer.
> Hopefully we can come to some consensus as to the best approach acorss both 
> dists. 
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg06360.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-1374) Workaround for license headers

2023-10-01 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770833#comment-17770833
 ] 

Sebastian Nagel commented on NUTCH-1374:


The package.html files were replaced by package-info.java containing a license 
header in NUTCH-2849

> Workaround for license headers
> --
>
> Key: NUTCH-1374
> URL: https://issues.apache.org/jira/browse/NUTCH-1374
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Currently in both versions of Nutch we have two types of files which DO NOT 
> contain license headers; namely all package.html files and the test files 
> within the language detection plugin. On my initial tests, adding license 
> headers to the language test files breaks the tests so we need to find a 
> workaround (or the correct synatx) to add commented out license headers to 
> these files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-01 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770831#comment-17770831
 ] 

Sebastian Nagel commented on NUTCH-1635:


Hi [~markus17], did this continue to happen in the last years? Esp., after 
upgrading the MapReduce API (NUTCH-2375).

> New crawldb sometimes ends up in current
> 
>
> Key: NUTCH-1635
> URL: https://issues.apache.org/jira/browse/NUTCH-1635
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Priority: Major
>
> In some weird cases the newly created crawldb by updatedb ends up in 
> crawl/crawldb/current//. So instead of replacing current/, it ends up 
> inside current/! This causes the generator to fail.
> It's impossible to reliably reproduce the problem. It only happened a couple 
> of times in the last few years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1947.

Resolution: Abandoned

Closing because OutlinkExtractor has seen many updates since then: upgrade to 
Java 8, replacement of Apache ORO to java.util.regex, etc.

> Overhaul o.a.n.parse.OutlinkExtractor.java 
> ---
>
> Key: NUTCH-1947
> URL: https://issues.apache.org/jira/browse/NUTCH-1947
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3, 1.9
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Right now in both trunk and 2.X, the 
> [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java]
>  class need a bit of TLC. It is referencing JDK1.5 in a few places, there are 
> misleading URL entries and it boasts some interesting @Deprecation methods 
> which we could ideally remove.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1947.
--

> Overhaul o.a.n.parse.OutlinkExtractor.java 
> ---
>
> Key: NUTCH-1947
> URL: https://issues.apache.org/jira/browse/NUTCH-1947
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3, 1.9
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Right now in both trunk and 2.X, the 
> [OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java]
>  class need a bit of TLC. It is referencing JDK1.5 in a few places, there are 
> misleading URL entries and it boasts some interesting @Deprecation methods 
> which we could ideally remove.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2053.

Resolution: Abandoned

Closing this old issue (8 years), assuming that dependencies have been updated 
and cleaned up multiple times since then.

> Uncessary dependencies included in ivy.xml (post NUTCH-2038)
> 
>
> Key: NUTCH-2053
> URL: https://issues.apache.org/jira/browse/NUTCH-2053
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Currently in trunk we have an unnecessary dependency included within 
> ivy/ivy.xml
> https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101
> This needs to be removed.
> [~asitang] can you please provide context as to why this is OK? I don't want 
> to break your code so sorry for lack of understanding. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2053.
--

> Uncessary dependencies included in ivy.xml (post NUTCH-2038)
> 
>
> Key: NUTCH-2053
> URL: https://issues.apache.org/jira/browse/NUTCH-2053
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Currently in trunk we have an unnecessary dependency included within 
> ivy/ivy.xml
> https://github.com/apache/nutch/blob/trunk/ivy/ivy.xml#L99-L101
> This needs to be removed.
> [~asitang] can you please provide context as to why this is OK? I don't want 
> to break your code so sorry for lack of understanding. Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2423) Update contributor info page

2023-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2423.

Fix Version/s: (was: 1.20)
   Resolution: Fixed

The wiki pages were updated in 2020 and 2021. Thanks for reporting, [~krichter] 
!

> Update contributor info page
> 
>
> Key: NUTCH-2423
> URL: https://issues.apache.org/jira/browse/NUTCH-2423
> Project: Nutch
>  Issue Type: Task
>  Components: documentation, wiki
>Reporter: Karl-Philipp Richter
>Priority: Major
>  Labels: easytask, help-wanted
>
> The [contributor info 
> page](https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer) still 
> mentions subversion as SCM which I assume is obsolete because there's 
> git://git.apache.org/nutch.git. It should mention how the devs with write 
> access deal with pull/merge requests in general or on different popular 
> platforms (the information that they're not accepted is valuable as well).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)