Re: [PR] NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters [nutch]

2024-05-15 Thread via GitHub
lewismc merged PR #813: URL: https://github.com/apache/nutch/pull/813 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] Revert incorrect change [nutch-site]

2024-05-15 Thread via GitHub
lewismc commented on PR #2: URL: https://github.com/apache/nutch-site/pull/2#issuecomment-2112989006 Yes thank you @sebbASF -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-05-14 Thread via GitHub
sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2110558876 Thanks, @lewismc! The metrics wiki page was updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-05-14 Thread via GitHub
sebastian-nagel merged PR #814: URL: https://github.com/apache/nutch/pull/814 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3039 Failure to handle ftp:// URLs [nutch]

2024-05-14 Thread via GitHub
sebastian-nagel merged PR #812: URL: https://github.com/apache/nutch/pull/812 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] Revert incorrect change [nutch-site]

2024-05-11 Thread via GitHub
sebastian-nagel commented on PR #2: URL: https://github.com/apache/nutch-site/pull/2#issuecomment-2105982524 Thanks, @sebbASF! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Revert incorrect change [nutch-site]

2024-05-11 Thread via GitHub
sebastian-nagel merged PR #2: URL: https://github.com/apache/nutch-site/pull/2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[PR] Revert incorrect change [nutch-site]

2024-05-07 Thread via GitHub
sebbASF opened a new pull request, #2: URL: https://github.com/apache/nutch-site/pull/2 Nutch is currently not listed under the web-framework category on projects.apache.org -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] NUTCH-3054 Address deprecation of Node16 for all GitHub Actions [nutch]

2024-04-30 Thread via GitHub
lewismc merged PR #817: URL: https://github.com/apache/nutch/pull/817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

[PR] NUTCH-1806 Delegate processing of URL domains to crawler-common [nutch]

2024-04-29 Thread via GitHub
sebastian-nagel opened a new pull request, #816: URL: https://github.com/apache/nutch/pull/816 and NUTCH-1942 Remove TopLevelDomain - use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package

Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-04-28 Thread via GitHub
lewismc commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2081564107 Excellent @sebastian-nagel +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-28 Thread via GitHub
lewismc commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229 Excellent @sebastian-nagel  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-04-27 Thread via GitHub
sebastian-nagel commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2080743831 ... also fixed the Javadoc error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-27 Thread via GitHub
sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2080634329 Hi @lewismc: - "use parameterized logging": done - "augment the [metrics documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics) once this is merged.": will

Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-04-27 Thread via GitHub
sebastian-nagel commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2080603546 > we could provide a TestGenerator#testNullHostInReducer test case Good idea! Done, see 4729786. -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-25 Thread via GitHub
lewismc commented on code in PR #814: URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313 ## src/java/org/apache/nutch/crawl/Generator.java: ## @@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context context) try { sort =

Re: [PR] NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters [nutch]

2024-04-19 Thread via GitHub
lewismc commented on PR #813: URL: https://github.com/apache/nutch/pull/813#issuecomment-2067543713 The logging now looks as follows ```INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 1 URLExemptionFilter implementations:

[PR] NUTCH-3039 Failure to handle ftp:// URLs [nutch]

2024-04-11 Thread via GitHub
sebastian-nagel opened a new pull request, #812: URL: https://github.com/apache/nutch/pull/812 Pass ftp:// URLs to the standard JVM URLStreamHandler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] NUTCH-3038 Address issues discovered during 1.20 release management dryrun [nutch]

2024-04-08 Thread via GitHub
lewismc merged PR #811: URL: https://github.com/apache/nutch/pull/811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

[PR] NUTCH-3038 Address issues discovered during 1.20 release management dryrun [nutch]

2024-04-05 Thread via GitHub
lewismc opened a new pull request, #811: URL: https://github.com/apache/nutch/pull/811 PR for https://issues.apache.org/jira/browse/NUTCH-3038 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-04-04 Thread via GitHub
lewismc merged PR #810: URL: https://github.com/apache/nutch/pull/810 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub
CatChullain commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028497765 Thanks again, @lewismc. I did add those INFO messages, but I found an extra call to setIndexedConf from setConf that the filter() method handles more cleanly, so I removed that,

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub
lewismc commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028343327 Hi @CatChullain I associated this Jira ticket to the 1.20 release and made you assignee  We will get it merged soon and roll the release. -- This is an automated message from the

Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-30 Thread via GitHub
lewismc merged PR #808: URL: https://github.com/apache/nutch/pull/808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-30 Thread via GitHub
lewismc merged PR #807: URL: https://github.com/apache/nutch/pull/807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 [nutch]

2024-03-30 Thread via GitHub
lewismc merged PR #809: URL: https://github.com/apache/nutch/pull/809 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub
lewismc commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028304406 @CatChullain thanks for your patience whilst we work this one  > … I wonder where might be good spots for INFO level messages The reason I suggested that the log level be

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub
CatChullain commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028122038 Thanks, Lewis! I moved all four to DEBUG, but I wonder where might be good spots for INFO level messages. I'm thinking of the operator or tech who doesn't dig into code and has an issue

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-29 Thread via GitHub
lewismc commented on code in PR #810: URL: https://github.com/apache/nutch/pull/810#discussion_r1544806230 ## src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java: ## @@ -0,0 +1,284 @@ +/* + * Licensed to the Apache Software

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-26 Thread via GitHub
CatChullain commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2021774505 Thanks, Lewis! I got some of it done today. I'll consolidate the LOG statements a bit more tomorrow. -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-26 Thread via GitHub
lewismc commented on code in PR #810: URL: https://github.com/apache/nutch/pull/810#discussion_r1539452666 ## src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java: ## @@ -0,0 +1,266 @@ +package org.apache.nutch.indexer.arbitrary; +

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-26 Thread via GitHub
lewismc commented on code in PR #810: URL: https://github.com/apache/nutch/pull/810#discussion_r1539390873 ## src/plugin/index-arbitrary/ivy.xml: ## @@ -0,0 +1,41 @@ + + + + Review Comment: Please remove whitespace. ##

[PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-25 Thread via GitHub
CatChullain opened a new pull request, #810: URL: https://github.com/apache/nutch/pull/810 This is the initial code for an arbitrary indexing filter, NUTCH-3032. It could be helpful to let end users manipulate information at indexing time with their own code without the need for

Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-15 Thread via GitHub
sebastian-nagel commented on PR #808: URL: https://github.com/apache/nutch/pull/808#issuecomment-2000233258 Hi Lewis, it's done in three steps: 1. run `ant report-licenses` (Rat task) for core and all plugins 2. process all reports: list all combinations of , try to extract the

Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-15 Thread via GitHub
tballison closed pull request #799: NUTCH-3026 -- add statusOnly as an indexing option URL: https://github.com/apache/nutch/pull/799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] WIP StatsD metrics example [nutch]

2024-03-14 Thread via GitHub
lewismc commented on PR #712: URL: https://github.com/apache/nutch/pull/712#issuecomment-1998875276 Closing this PR out. StatsD is widely used but open source Java SDK’s/agents are few and far between. When I get around to properly instrumenting Nutch I will probably suggest that we use

Re: [PR] WIP StatsD metrics example [nutch]

2024-03-14 Thread via GitHub
lewismc closed pull request #712: WIP StatsD metrics example URL: https://github.com/apache/nutch/pull/712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub
lewismc closed pull request #807: NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… URL: https://github.com/apache/nutch/pull/807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub
lewismc commented on PR #807: URL: https://github.com/apache/nutch/pull/807#issuecomment-1998718730 There are some tangential proposed changes (such as improvements to logging) to this PR but they concern the relevant Class files. -- This is an automated message from the Apache Git

Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-14 Thread via GitHub
lewismc commented on PR #808: URL: https://github.com/apache/nutch/pull/808#issuecomment-1998717443 Hi @sebastian-nagel did you perform this task manually? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-14 Thread via GitHub
lewismc closed pull request #808: NUTCH-3035 Update license and notice file for release of 1.20 URL: https://github.com/apache/nutch/pull/808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub
lewismc commented on PR #807: URL: https://github.com/apache/nutch/pull/807#issuecomment-1998714969 [Further guidance on browser compatibility/supported platforms](https://firefox-source-docs.mozilla.org/testing/geckodriver/Support.html) Along the way I discovered that **_full

Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub
lewismc commented on PR #807: URL: https://github.com/apache/nutch/pull/807#issuecomment-1998711992 PR ready or review. Tested on * MacBook Pro * Apple M1 Pro * Sonora 14.4 * Firefox 115.X (compatible with current version of Selenium) -- This is an automated message from the

[PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-14 Thread via GitHub
sebastian-nagel opened a new pull request, #808: URL: https://github.com/apache/nutch/pull/808 Update the license and notice files of dependencies included as binary jar files in the binary release. -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]

2024-03-14 Thread via GitHub
sebastian-nagel merged PR #806: URL: https://github.com/apache/nutch/pull/806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]

2024-03-13 Thread via GitHub
lewismc commented on PR #806: URL: https://github.com/apache/nutch/pull/806#issuecomment-1995922015 Tested with ES 7.10.2 6 node cluster. +1 LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]

2024-03-13 Thread via GitHub
sebastian-nagel opened a new pull request, #806: URL: https://github.com/apache/nutch/pull/806 This PR downgrades the ES client to version 7.10.2 which is licensed under ASF 2.0 - it's a quick fix to stay compatible with ASF policies. Not yet tested: indexing into ES To be

Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-13 Thread via GitHub
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1994562354 Thanks @sebastian-nagel  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-13 Thread via GitHub
lewismc merged PR #803: URL: https://github.com/apache/nutch/pull/803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]

2024-03-12 Thread via GitHub
lewismc commented on PR #805: URL: https://github.com/apache/nutch/pull/805#issuecomment-1993567784 Thanks @derhecht  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]

2024-03-12 Thread via GitHub
lewismc merged PR #805: URL: https://github.com/apache/nutch/pull/805 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-12 Thread via GitHub
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1993545146 After lots of trial and error I think I cracked this one. Ultimately there were several places where the optional `(-[classifier])` element has to be added to the `ivy:retrieve pattern`.

Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-12 Thread via GitHub
lewismc closed pull request #803: NUTCH-3033 Upgrade Ivy to v2.5.2 URL: https://github.com/apache/nutch/pull/803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Update Dockerfile / JAVA_HOME [nutch]

2024-03-12 Thread via GitHub
derhecht commented on PR #801: URL: https://github.com/apache/nutch/pull/801#issuecomment-1991968018 see #805 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]

2024-03-12 Thread via GitHub
derhecht opened a new pull request, #805: URL: https://github.com/apache/nutch/pull/805 Alpine is using ash shell by default which results in an not set JAVA_HOME environment variable Sry, there is no issue reported atm on issues.apache.org - never the less, it is one I'm facing to

Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-11 Thread via GitHub
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1989446036 Hmmm, I upgraded to 2.5.1 and the CI runs just fine. Looks like there is some regression/additional configuration required with 2.5.2. I’m asking the question over on ivy-user@ mailing list.

Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-11 Thread via GitHub
lewismc commented on PR #799: URL: https://github.com/apache/nutch/pull/799#issuecomment-1989404991 Hmmm. It appears that there are problems with the `protocol-http` unit tests… ``` [echo] Testing plugin: protocol-http [junit] Running

Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-11 Thread via GitHub
lewismc closed pull request #799: NUTCH-3026 -- add statusOnly as an indexing option URL: https://github.com/apache/nutch/pull/799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Update Dockerfile / JAVA_HOME [nutch]

2024-03-11 Thread via GitHub
lewismc commented on PR #801: URL: https://github.com/apache/nutch/pull/801#issuecomment-1989379558 @derhecht apologies I merged this mistakenly. Can you please submit the PR against master branch? Thank you -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-11 Thread via GitHub
lewismc commented on PR #799: URL: https://github.com/apache/nutch/pull/799#issuecomment-1989380993 Reopening to have CI run again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Revert "Update Dockerfile / JAVA_HOME" [nutch]

2024-03-11 Thread via GitHub
lewismc merged PR #804: URL: https://github.com/apache/nutch/pull/804 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

[PR] Revert "Update Dockerfile / JAVA_HOME" [nutch]

2024-03-11 Thread via GitHub
lewismc opened a new pull request, #804: URL: https://github.com/apache/nutch/pull/804 Reverts apache/nutch#801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Update Dockerfile / JAVA_HOME [nutch]

2024-03-11 Thread via GitHub
lewismc merged PR #801: URL: https://github.com/apache/nutch/pull/801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-11 Thread via GitHub
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1989365064 OK so it looks like the [newer Ivy version is being used just fine](https://github.com/apache/nutch/actions/runs/8239165168/job/22531780061?pr=803#step:4:78). The build did however fail with

[PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-11 Thread via GitHub
lewismc opened a new pull request, #803: URL: https://github.com/apache/nutch/pull/803 PR for https://issues.apache.org/jira/browse/NUTCH-3033 I was having trouble locally resolving the Ivy version to 2.5.2… I can’t yet figure out why 2.5.1 was being used. I’ll check out the CI log and

Re: [PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]

2024-03-10 Thread via GitHub
sebastian-nagel commented on PR #800: URL: https://github.com/apache/nutch/pull/800#issuecomment-1987171023 Thanks, @derhecht! Good catch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]

2024-03-10 Thread via GitHub
sebastian-nagel merged PR #800: URL: https://github.com/apache/nutch/pull/800 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] fix for NUTCH-3027 contributed by skehrli [nutch]

2024-03-10 Thread via GitHub
sebastian-nagel closed pull request #802: fix for NUTCH-3027 contributed by skehrli URL: https://github.com/apache/nutch/pull/802 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] fix for NUTCH-3027 contributed by skehrli [nutch]

2024-03-10 Thread via GitHub
sebastian-nagel commented on PR #802: URL: https://github.com/apache/nutch/pull/802#issuecomment-1987165751 Patch applied to master in d95e1a7, see comments on Jira in NUTCH-3027. Thanks again @skehrli ! -- This is an automated message from the Apache Git Service. To respond to the

[PR] fix for NUTCH-3027 contributed by skehrli [nutch]

2024-01-18 Thread via GitHub
skehrli opened a new pull request, #802: URL: https://github.com/apache/nutch/pull/802 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

Re: [PR] NUTCH-1541 Indexer plugin to write CSV [nutch]

2024-01-04 Thread via GitHub
lewismc commented on PR #294: URL: https://github.com/apache/nutch/pull/294#issuecomment-1877232892 Hi @grege117 I’ll try to have a crack at this _soon_. Thanks for the heads up. If you feel like forking the branch and having a go at the fix, then please do. I will try to shepherd in your

Re: [PR] NUTCH-1541 Indexer plugin to write CSV [nutch]

2023-12-22 Thread via GitHub
grege117 commented on PR #294: URL: https://github.com/apache/nutch/pull/294#issuecomment-1868000580 Sorry to chime in a few years late, but I'm not sure this plugin is configured correctly. If I modify my conf/index-writers.xml and remove everything except for ", you will get the

[PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]

2023-12-14 Thread via GitHub
derhecht opened a new pull request, #800: URL: https://github.com/apache/nutch/pull/800 Show --dedup-group instead of -dedup-group which have lead to misunderstanding output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] NUTCH-3024 Remove flaky 'dependency check' target [nutch]

2023-11-24 Thread via GitHub
lewismc merged PR #795: URL: https://github.com/apache/nutch/pull/795 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

[PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2023-11-17 Thread via GitHub
tballison opened a new pull request, #799: URL: https://github.com/apache/nutch/pull/799 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[PR] fix for NUTCH-2812 contributed by GabeHaegele [nutch]

2023-11-08 Thread via GitHub
GabeHaegele opened a new pull request, #798: URL: https://github.com/apache/nutch/pull/798 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub
sebastian-nagel commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1802531264 Thanks, @jnioche! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub
sebastian-nagel merged PR #796: URL: https://github.com/apache/nutch/pull/796 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub
jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1801938355 @sebastian-nagel merged the changes from master and made a few improvements -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-11-08 Thread via GitHub
sebastian-nagel commented on PR #793: URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549 Thanks, @jnioche! Merged into master, adding the lines to make use of Hadoop-provided compression codecs. Successfully tested in local and pseudo-distributed mode with

Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-11-08 Thread via GitHub
sebastian-nagel closed pull request #793: [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 URL: https://github.com/apache/nutch/pull/793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub
jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1798221743 Writing a test for this thing is an absolute pain. The way the filters are used for real is that their method setConf is called and the rules are loaded using _getConfResourceAsReader_, i.e.

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub
jnioche commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384621727 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub
sebastian-nagel commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384536930 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {

Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-06 Thread via GitHub
tballison merged PR #794: URL: https://github.com/apache/nutch/pull/794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison merged PR #797: URL: https://github.com/apache/nutch/pull/797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison commented on PR #797: URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171 ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, Errors: 0, Skipped: 4, Time elapsed: 4.342 sec 2023-11-06T15:02:48.2192793Z [junit] Test

Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison commented on PR #797: URL: https://github.com/apache/nutch/pull/797#issuecomment-1794934171 Need to keep as draft until the 2.9.1.0 shim actually lands in maven central. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison opened a new pull request, #797: URL: https://github.com/apache/nutch/pull/797 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[PR] NUTCH-3024 Remove flaky 'dependency check' target [nutch]

2023-11-03 Thread via GitHub
lewismc opened a new pull request, #795: URL: https://github.com/apache/nutch/pull/795 Addresses https://issues.apache.org/jira/browse/NUTCH-3024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-11-02 Thread via GitHub
lewismc merged PR #789: URL: https://github.com/apache/nutch/pull/789 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-11-02 Thread via GitHub
lewismc commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r138646 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config) @Override

Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-01 Thread via GitHub
lewismc commented on PR #794: URL: https://github.com/apache/nutch/pull/794#issuecomment-1789810071 We have no tests for `ParseSegment` right now. I think it would be excellent if this PR could include a test for `ParseSegment.isTruncated`. -- This is an automated message from the Apache

[PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-01 Thread via GitHub
tballison opened a new pull request, #794: URL: https://github.com/apache/nutch/pull/794 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-10-31 Thread via GitHub
sebastian-nagel commented on code in PR #793: URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -181,9 +186,23 @@ public String filter(String url) { public

Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-10-31 Thread via GitHub
sebastian-nagel commented on code in PR #793: URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -181,9 +186,23 @@ public String filter(String url) { public

Re: [PR] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] [nutch]

2023-10-30 Thread via GitHub
jnioche commented on PR #792: URL: https://github.com/apache/nutch/pull/792#issuecomment-1785804884 Obivously, pulled more changes than I meant to -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] [nutch]

2023-10-30 Thread via GitHub
jnioche closed pull request #792: Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] URL: https://github.com/apache/nutch/pull/792 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[PR] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] [nutch]

2023-10-30 Thread via GitHub
jnioche opened a new pull request, #792: URL: https://github.com/apache/nutch/pull/792 See description in https://issues.apache.org/jira/browse/NUTCH-3017 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-10-29 Thread via GitHub
sebastian-nagel commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config)

  1   2   3   >