[PR] NUTCH-3058 Fetcher: counter for hung threads [nutch]
sebastian-nagel opened a new pull request, #820: URL: https://github.com/apache/nutch/pull/820 - count the number of hung threads in a fetcher job - log and count the number of fetch items still queued when the "hard" timeout is reached -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3055 README: fix Github "hub" commands [nutch]
sebastian-nagel merged PR #818: URL: https://github.com/apache/nutch/pull/818 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]
sebastian-nagel merged PR #815: URL: https://github.com/apache/nutch/pull/815 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3057 - Fix for index-arbitrary plugin improper retention and us… [nutch]
lewismc commented on PR #819: URL: https://github.com/apache/nutch/pull/819#issuecomment-2118551238 Thanks for reporting @CatChullain i didn’t catch this edge case either when reviewing or testing. Out curiosity what does your deployment look like? Local or deploy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3057 - Fix for index-arbitrary plugin improper retention and us… [nutch]
CatChullain opened a new pull request, #819: URL: https://github.com/apache/nutch/pull/819 Fix for NUTCH-3057 where index-arbitrary plugin retained value for a field and erroneously set it to the next field declared in its config stanzas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters [nutch]
lewismc merged PR #813: URL: https://github.com/apache/nutch/pull/813 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Revert incorrect change [nutch-site]
lewismc commented on PR #2: URL: https://github.com/apache/nutch-site/pull/2#issuecomment-2112989006 Yes thank you @sebbASF -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]
sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2110558876 Thanks, @lewismc! The metrics wiki page was updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]
sebastian-nagel merged PR #814: URL: https://github.com/apache/nutch/pull/814 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3039 Failure to handle ftp:// URLs [nutch]
sebastian-nagel merged PR #812: URL: https://github.com/apache/nutch/pull/812 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Revert incorrect change [nutch-site]
sebastian-nagel commented on PR #2: URL: https://github.com/apache/nutch-site/pull/2#issuecomment-2105982524 Thanks, @sebbASF! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Revert incorrect change [nutch-site]
sebastian-nagel merged PR #2: URL: https://github.com/apache/nutch-site/pull/2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Revert incorrect change [nutch-site]
sebbASF opened a new pull request, #2: URL: https://github.com/apache/nutch-site/pull/2 Nutch is currently not listed under the web-framework category on projects.apache.org -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3054 Address deprecation of Node16 for all GitHub Actions [nutch]
lewismc merged PR #817: URL: https://github.com/apache/nutch/pull/817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-1806 Delegate processing of URL domains to crawler-common [nutch]
sebastian-nagel opened a new pull request, #816: URL: https://github.com/apache/nutch/pull/816 and NUTCH-1942 Remove TopLevelDomain - use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package - adapt and extend unit tests - add tests for URLUtil.getTopLevelDomainName(url) - reflect changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix) - adapt to minor API changes - URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found - for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation - add unit tests for host names with trailing dot ("www.apache.org.") - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit test for URLs without host/domain (cf. NUTCH-2450) - update and complete Javadoc - update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil - remove the class TLDScoringFilter. The configuration is bound to the domain-suffixes.xml which wasn't maintained anymore and is now removed - remove package org.apache.nutch.util.domain - move DomainStatistics to org.apache.nutch.util - remove configuration files of domain utils -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]
lewismc commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2081564107 Excellent @sebastian-nagel +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]
lewismc commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229 Excellent @sebastian-nagel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]
sebastian-nagel commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2080743831 ... also fixed the Javadoc error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]
sebastian-nagel commented on PR #814: URL: https://github.com/apache/nutch/pull/814#issuecomment-2080634329 Hi @lewismc: - "use parameterized logging": done - "augment the [metrics documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics) once this is merged.": will do - "we could also [create a test for the counters](https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial#MRUnitTutorial-TestingCounters).": for now, TestGenerator is not based on MRUNIT. The various Generator::generate(...) return the number of generated segments without a way to access the counters (they're logged, however). I'd prefer to track this in a separate issue, because it would require to many code changes to read the counters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]
sebastian-nagel commented on PR #815: URL: https://github.com/apache/nutch/pull/815#issuecomment-2080603546 > we could provide a TestGenerator#testNullHostInReducer test case Good idea! Done, see 4729786. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]
lewismc commented on code in PR #814: URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313 ## src/java/org/apache/nutch/crawl/Generator.java: ## @@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context context) try { sort = scfilters.generatorSortValue(key, crawlDatum, sort); } catch (ScoringFilterException sfe) { -if (LOG.isWarnEnabled()) { - LOG.warn( - "Couldn't filter generatorSortValue for " + key + ": " + sfe); -} +LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe); Review Comment: Please use parameterized logging. ``` LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters [nutch]
lewismc commented on PR #813: URL: https://github.com/apache/nutch/pull/813#issuecomment-2067543713 The logging now looks as follows ```INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 1 URLExemptionFilter implementations: '[org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter@3090c372]’```. If no URLExemptionFilter implementations are found then no log statement is produced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3039 Failure to handle ftp:// URLs [nutch]
sebastian-nagel opened a new pull request, #812: URL: https://github.com/apache/nutch/pull/812 Pass ftp:// URLs to the standard JVM URLStreamHandler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3038 Address issues discovered during 1.20 release management dryrun [nutch]
lewismc merged PR #811: URL: https://github.com/apache/nutch/pull/811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3038 Address issues discovered during 1.20 release management dryrun [nutch]
lewismc opened a new pull request, #811: URL: https://github.com/apache/nutch/pull/811 PR for https://issues.apache.org/jira/browse/NUTCH-3038 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
lewismc merged PR #810: URL: https://github.com/apache/nutch/pull/810 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
CatChullain commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028497765 Thanks again, @lewismc. I did add those INFO messages, but I found an extra call to setIndexedConf from setConf that the filter() method handles more cleanly, so I removed that, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
lewismc commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028343327 Hi @CatChullain I associated this Jira ticket to the 1.20 release and made you assignee We will get it merged soon and roll the release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]
lewismc merged PR #808: URL: https://github.com/apache/nutch/pull/808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]
lewismc merged PR #807: URL: https://github.com/apache/nutch/pull/807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 [nutch]
lewismc merged PR #809: URL: https://github.com/apache/nutch/pull/809 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
lewismc commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028304406 @CatChullain thanks for your patience whilst we work this one > … I wonder where might be good spots for INFO level messages The reason I suggested that the log level be revised from `INFO` to `DEBUG` was that any logging needs to make sense in the context of the entire log. Said another way, plugin logging needs to complement the core crawler tasks. That being said, if you want to include `INFO` for the following scenarios then please go ahead. * recording the count value, and * indicating when overwrite is true Your rationale is sound. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
CatChullain commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2028122038 Thanks, Lewis! I moved all four to DEBUG, but I wonder where might be good spots for INFO level messages. I'm thinking of the operator or tech who doesn't dig into code and has an issue in the config. During dev & test myself, I sometimes forgot to increment the index.arbitrary.function.count and the plugin ignored the later fields. Just outputting that count value, and maybe something when overwrite is true, might be helpful for alerting someone that the config might not be what they'd believed. Do either of those (or something else) seem worthwhile, or does it make more sense to let people use it and see what issues they raise? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
lewismc commented on code in PR #810: URL: https://github.com/apache/nutch/pull/810#discussion_r1544806230 ## src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java: ## @@ -0,0 +1,284 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.arbitrary; + +import org.apache.nutch.parse.Parse; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlinks; + +import org.apache.hadoop.io.Text; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.lang.invoke.MethodHandles; +import java.lang.Class; +import java.lang.reflect.Constructor; +import java.lang.reflect.Method; + +import org.apache.hadoop.conf.Configuration; + +/** + * Adds arbitrary searchable fields to a document from the class and method + * the user identifies in the config. The user supplies the name of the field + * to add with the class and method names that supply the value. + * + * Example: + * property + * nameindex.arbitrary.function.count/name + * value1/value + * /property + * + * property + * nameindex.arbitrary.fieldName.0/name + * valueadvisors/value + * /property + * + * property + * nameindex.arbitrary.className.0/name + * valuecom.example.arbitrary.AdvisorCalculator/value + * /property + * + * property + * nameindex.arbitrary.constructorArgs.0/name + * valueKirk/value + * /property + * + * property + * nameindex.arbitrary.methodName.0/name + * valuecountAdvisors/value + * /property + * + * property + * nameindex.arbitrary.methodArgs.0/name + * valueSpock,McCoy/value + * /property + * + * To set more than one arbitrary field value, + * increment {@code index.arbitrary.function.count} and + * repeat the rest of these blocks with successive int values + * appended to the property names, e.g. fieldName.1, methodName.1, etc. + */ +public class ArbitraryIndexingFilter implements IndexingFilter { + + private static final Logger LOG = LoggerFactory +.getLogger(MethodHandles.lookup().lookupClass()); + + /** How many arbitrary field definitions to set. */ + private int arbitraryAddsCount = 0; + + /** The name of the field to insert/overwrite in the NutchDocument */ + private String fieldName; + + /** The fully-qualified class name of the custom class to use for the + * new field. This class must be in the Nutch runtime classpath, + * e.g., nutch/lib/ dierctory. */ + private String className; + + /** The String values to pass to the custom class constructor. The plugin + * will add the document url as the first argument in className's + * String[] args. */ + private String[] userConstrArgs; + + /** The array where the plugin copies the url the userConstrArgs + * to create the instance of className. */ + private String[] constrArgs; + + /** The name of the method in the custom class to call. Its return value + * will become the value of fieldName in the NutchDocument. */ + private String methodName; + + /** The String values of the arguments to methodName. It's up to the + * developer of className to do any casts/conversions from String to + * another class in the code of className. */ + private String[] methodArgs; + + /** The result that returns from methodName. The plugin will set the value + * of fieldName to this. */ + private Object result; + + /** Optional flag to determine whether to overwrite the existing value in the + * NutchDocument fieldName if this is set to true. Default behavior is to + * add the value from calling methodName to existing values for fieldName. */ + private boolean overwrite = false; + + /** Hadoop Configuration object to pass these values into the plugin. */ + private Configuration conf; + + /** + * The {@link ArbitraryIndexingFilter} filter object uses reflection + * to instantiate the configured class and invoke the configured method. + * It requires a few configuration settings for adding arbitrary fields + * and values to the NutchDocument as searchable fields.
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
CatChullain commented on PR #810: URL: https://github.com/apache/nutch/pull/810#issuecomment-2021774505 Thanks, Lewis! I got some of it done today. I'll consolidate the LOG statements a bit more tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
lewismc commented on code in PR #810: URL: https://github.com/apache/nutch/pull/810#discussion_r1539452666 ## src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java: ## @@ -0,0 +1,266 @@ +package org.apache.nutch.indexer.arbitrary; + +import org.apache.nutch.parse.Parse; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlinks; + +import org.apache.hadoop.io.Text; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.lang.invoke.MethodHandles; +import java.lang.Class; +import java.lang.reflect.Constructor; +import java.lang.reflect.Method; + +import org.apache.hadoop.conf.Configuration; + +/** + * Adds arbitrary searchable fields to a document from the class and method + * the user identifies in the config. The user supplies the name of the field + * to add with the class and method names that supply the value. + * + * Example: + * property + * nameindex.arbitrary.function.count/name + * value1/value + * /property + * + * property + * nameindex.arbitrary.fieldName.0/name + * valueadvisors/value + * /property + * + * property + * nameindex.arbitrary.className.0/name + * valuecom.example.arbitrary.AdvisorCalculator/value + * /property + * + * property + * nameindex.arbitrary.constructorArgs.0/name + * valueKirk/value + * /property + * + * property + * nameindex.arbitrary.methodName.0/name + * valuecountAdvisors/value + * /property + * + * property + * nameindex.arbitrary.methodArgs.0/name + * valueSpock,McCoy/value + * /property + * + * To set more than one arbitrary field value, + * increment {@code index.arbitrary.function.count} and + * repeat the rest of these blocks with successive int values + * appended to the property names, e.g. fieldName.1, methodName.1, etc. + */ +public class ArbitraryIndexingFilter implements IndexingFilter { + + private static final Logger LOG = LoggerFactory +.getLogger(MethodHandles.lookup().lookupClass()); + + /** How many arbitrary field definitions to set. */ + private int arbitraryAddsCount = 0; + + /** The name of the field to insert/overwrite in the NutchDocument */ + private String fieldName; + + /** The fully-qualified class name of the custom class to use for the + * new field. This class must be in the Nutch runtime classpath, + * e.g., nutch/lib/ dierctory. */ + private String className; + + /** The String values to pass to the custom class constructor. The plugin + * will add the document url as the first argument in className's + * String[] args. */ + private String[] userConstrArgs; + + /** The array where the plugin copies the url the userConstrArgs + * to create the instance of className. */ + private String[] constrArgs; + + /** The name of the method in the custom class to call. Its return value + * will become the value of fieldName in the NutchDocument. */ + private String methodName; + + /** The String values of the arguments to methodName. It's up to the + * developer of className to do any casts/conversions from String to + * another class in the code of className. */ + private String[] methodArgs; + + /** The result that returns from methodName. The plugin will set the value + * of fieldName to this. */ + private Object result; + + /** Optional flag to determine whether to overwrite the existing value in the + * NutchDocument fieldName if this is set to true. Default behavior is to + * add the value from calling methodName to existing values for fieldName. */ + private boolean overwrite = false; + + /** Hadoop Configuration object to pass these values into the plugin. */ + private Configuration conf; + + /** + * The {@link ArbitraryIndexingFilter} filter object uses reflection + * to instantiate the configured class and invoke the configured method. + * It requires a few configuration settings for adding arbitrary fields + * and values to the NutchDocument as searchable fields. + * See {@code index.arbitrary.function.count}, and (possibly multiple + * instances when {@code index.arbitrary.function.count} 1) of the following + * {@code index.arbitrary.fieldName}.index, + * {@code index.arbitrary.className}.index, + * {@code index.arbitrary.constructorArgs}.index, + * {@code index.arbitrary.methodName}.index, and + * {@code index.arbitrary.methodArgs}.index + * in nutch-default.xml or nutch-site.xml where index ranges from 0 + * to {@code index.arbitrary.function.count} - 1. + * + * @param doc + * The {@link NutchDocument} object + * @param parse + * The relevant {@link Parse} object passing through the filter + * @param url + * URL to be filtered by the user-specified class + * @param datum + * The {@link
Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
lewismc commented on code in PR #810: URL: https://github.com/apache/nutch/pull/810#discussion_r1539390873 ## src/plugin/index-arbitrary/ivy.xml: ## @@ -0,0 +1,41 @@ + + + + Review Comment: Please remove whitespace. ## src/plugin/index-arbitrary/src/test/org/apache/nutch/indexer/arbitrary/Multiplier.java: ## @@ -0,0 +1,31 @@ +package org.apache.nutch.indexer.arbitrary; Review Comment: Please add ALv2 license header. ## build.xml: ## @@ -44,7 +44,7 @@ - + Review Comment: Please sync with `master` branch. This regression is tangential to NUTCH-3032. Thanks ## src/plugin/index-arbitrary/src/test/org/apache/nutch/indexer/arbitrary/Echo.java: ## @@ -0,0 +1,24 @@ +package org.apache.nutch.indexer.arbitrary; Review Comment: Please add ALv2 license header. ## src/plugin/index-arbitrary/ivy.xml: ## @@ -0,0 +1,41 @@ + + Review Comment: Please remove whitespace. ## src/plugin/index-arbitrary/build.xml: ## @@ -0,0 +1,6 @@ + Review Comment: Please add ALv2 license header. ## src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java: ## @@ -0,0 +1,266 @@ +package org.apache.nutch.indexer.arbitrary; Review Comment: Please add ALv2 license header. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]
CatChullain opened a new pull request, #810: URL: https://github.com/apache/nutch/pull/810 This is the initial code for an arbitrary indexing filter, NUTCH-3032. It could be helpful to let end users manipulate information at indexing time with their own code without the need for writing their own indexing plugin. I mentioned this on the dev mailing list (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some description of my work in progress. One potential use is to address some of the same concerns that NUTCH-585 discusses regarding an alternative approach to picking and choosing which content to index, but this approach would allow making index time decisions, rather than setting the configuration for all content at the start of the indexing run. Ideally a solution to NUTCH-585 would work at parse time, but this index time code still has potential uses for any kind of look-ups or calculations that depend on values in the document where the need to manipulate data exceeds what something like Jexl filter can do easily, or where outside data is worth incorporating into the document for use after indexing. Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]
sebastian-nagel commented on PR #808: URL: https://github.com/apache/nutch/pull/808#issuecomment-2000233258 Hi Lewis, it's done in three steps: 1. run `ant report-licenses` (Rat task) for core and all plugins 2. process all reports: list all combinations of , try to extract the organization and project description from the ivy cache, normalize license names, etc. 3. manually verify the output of step 1 and 2 and merge it with the existing license and notice files Step 1 and 2 are done by the Jupyter notebook attached to NUTCH-2290. Because the output (NOTICE-binary and LICENSE-binary) is somewhat noisy, manual verification is necessary. See also NUTCH-2981: Storm has some scripts to automatically generate the license reports. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]
tballison closed pull request #799: NUTCH-3026 -- add statusOnly as an indexing option URL: https://github.com/apache/nutch/pull/799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] WIP StatsD metrics example [nutch]
lewismc commented on PR #712: URL: https://github.com/apache/nutch/pull/712#issuecomment-1998875276 Closing this PR out. StatsD is widely used but open source Java SDK’s/agents are few and far between. When I get around to properly instrumenting Nutch I will probably suggest that we use [Apache SkyWalking](https://skywalking.apache.org/). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] WIP StatsD metrics example [nutch]
lewismc closed pull request #712: WIP StatsD metrics example URL: https://github.com/apache/nutch/pull/712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]
lewismc closed pull request #807: NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… URL: https://github.com/apache/nutch/pull/807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]
lewismc commented on PR #807: URL: https://github.com/apache/nutch/pull/807#issuecomment-1998718730 There are some tangential proposed changes (such as improvements to logging) to this PR but they concern the relevant Class files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]
lewismc commented on PR #808: URL: https://github.com/apache/nutch/pull/808#issuecomment-1998717443 Hi @sebastian-nagel did you perform this task manually? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]
lewismc closed pull request #808: NUTCH-3035 Update license and notice file for release of 1.20 URL: https://github.com/apache/nutch/pull/808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]
lewismc commented on PR #807: URL: https://github.com/apache/nutch/pull/807#issuecomment-1998714969 [Further guidance on browser compatibility/supported platforms](https://firefox-source-docs.mozilla.org/testing/geckodriver/Support.html) Along the way I discovered that **_full screenshots_** ar now handled differently so we need to rethink how to do this. For example, the [FirefoxDriver has a pretty elegant way of doing this](https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/firefox/FirefoxDriver.html#getFullPageScreenshotAs(org.openqa.selenium.OutputType)) but it is different on other browsers. For the time being each browser can take a screenshot of the view window/partial webpage. This is satisfactory but there is room for improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]
lewismc commented on PR #807: URL: https://github.com/apache/nutch/pull/807#issuecomment-1998711992 PR ready or review. Tested on * MacBook Pro * Apple M1 Pro * Sonora 14.4 * Firefox 115.X (compatible with current version of Selenium) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]
sebastian-nagel opened a new pull request, #808: URL: https://github.com/apache/nutch/pull/808 Update the license and notice files of dependencies included as binary jar files in the binary release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]
sebastian-nagel merged PR #806: URL: https://github.com/apache/nutch/pull/806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]
lewismc commented on PR #806: URL: https://github.com/apache/nutch/pull/806#issuecomment-1995922015 Tested with ES 7.10.2 6 node cluster. +1 LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]
sebastian-nagel opened a new pull request, #806: URL: https://github.com/apache/nutch/pull/806 This PR downgrades the ES client to version 7.10.2 which is licensed under ASF 2.0 - it's a quick fix to stay compatible with ASF policies. Not yet tested: indexing into ES To be done: update the LICENSE and NOTICE files. I'll do this as part of a separate issue NUTCH-3035. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1994562354 Thanks @sebastian-nagel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]
lewismc merged PR #803: URL: https://github.com/apache/nutch/pull/803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]
lewismc commented on PR #805: URL: https://github.com/apache/nutch/pull/805#issuecomment-1993567784 Thanks @derhecht -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]
lewismc merged PR #805: URL: https://github.com/apache/nutch/pull/805 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1993545146 After lots of trial and error I think I cracked this one. Ultimately there were several places where the optional `(-[classifier])` element has to be added to the `ivy:retrieve pattern`. This wasn’t particularly intuitive as the ivy documentation is [somewhat lacking in this regard](https://ant.apache.org/ivy/history/2.5.2/resolver/filesystem.html#_child_elements) however @bodewig [pointed me in the right direction on the ivy-user@ mailing list](https://lists.apache.org/thread/fdd9r5gkdk5215hc9swcxhjwyvnzoz0w). Thank you for that @boedwig. This PR is ready for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]
lewismc closed pull request #803: NUTCH-3033 Upgrade Ivy to v2.5.2 URL: https://github.com/apache/nutch/pull/803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update Dockerfile / JAVA_HOME [nutch]
derhecht commented on PR #801: URL: https://github.com/apache/nutch/pull/801#issuecomment-1991968018 see #805 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]
derhecht opened a new pull request, #805: URL: https://github.com/apache/nutch/pull/805 Alpine is using ash shell by default which results in an not set JAVA_HOME environment variable Sry, there is no issue reported atm on issues.apache.org - never the less, it is one I'm facing to see #801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1989446036 Hmmm, I upgraded to 2.5.1 and the CI runs just fine. Looks like there is some regression/additional configuration required with 2.5.2. I’m asking the question over on ivy-user@ mailing list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]
lewismc commented on PR #799: URL: https://github.com/apache/nutch/pull/799#issuecomment-1989404991 Hmmm. It appears that there are problems with the `protocol-http` unit tests… ``` [echo] Testing plugin: protocol-http [junit] Running org.apache.nutch.protocol.http.TestBadServerResponses [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.846 sec [junit] Running org.apache.nutch.parse.tika.TestHtmlParser [junit] Tests run: 9, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 3.659 sec [junit] Test org.apache.nutch.protocol.http.TestBadServerResponses FAILED [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.599 sec [junit] Running org.apache.nutch.protocol.http.TestProtocolHttp [junit] Running org.apache.nutch.parse.tika.TestImageMetadata [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.074 sec [junit] Running org.apache.nutch.protocol.http.TestResponse [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 1.699 sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]
lewismc closed pull request #799: NUTCH-3026 -- add statusOnly as an indexing option URL: https://github.com/apache/nutch/pull/799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update Dockerfile / JAVA_HOME [nutch]
lewismc commented on PR #801: URL: https://github.com/apache/nutch/pull/801#issuecomment-1989379558 @derhecht apologies I merged this mistakenly. Can you please submit the PR against master branch? Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]
lewismc commented on PR #799: URL: https://github.com/apache/nutch/pull/799#issuecomment-1989380993 Reopening to have CI run again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Revert "Update Dockerfile / JAVA_HOME" [nutch]
lewismc merged PR #804: URL: https://github.com/apache/nutch/pull/804 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Revert "Update Dockerfile / JAVA_HOME" [nutch]
lewismc opened a new pull request, #804: URL: https://github.com/apache/nutch/pull/804 Reverts apache/nutch#801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update Dockerfile / JAVA_HOME [nutch]
lewismc merged PR #801: URL: https://github.com/apache/nutch/pull/801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]
lewismc commented on PR #803: URL: https://github.com/apache/nutch/pull/803#issuecomment-1989365064 OK so it looks like the [newer Ivy version is being used just fine](https://github.com/apache/nutch/actions/runs/8239165168/job/22531780061?pr=803#step:4:78). The build did however fail with the following ``` Caused by: java.lang.RuntimeException: Multiple artifacts of the module io.netty#netty-transport-native-kqueue;4.1.84.Final are retrieved to the same file! Update the retrieve pattern to fix this error. ``` … investigating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]
lewismc opened a new pull request, #803: URL: https://github.com/apache/nutch/pull/803 PR for https://issues.apache.org/jira/browse/NUTCH-3033 I was having trouble locally resolving the Ivy version to 2.5.2… I can’t yet figure out why 2.5.1 was being used. I’ll check out the CI log and see if the newer version is used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]
sebastian-nagel commented on PR #800: URL: https://github.com/apache/nutch/pull/800#issuecomment-1987171023 Thanks, @derhecht! Good catch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]
sebastian-nagel merged PR #800: URL: https://github.com/apache/nutch/pull/800 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] fix for NUTCH-3027 contributed by skehrli [nutch]
sebastian-nagel closed pull request #802: fix for NUTCH-3027 contributed by skehrli URL: https://github.com/apache/nutch/pull/802 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] fix for NUTCH-3027 contributed by skehrli [nutch]
sebastian-nagel commented on PR #802: URL: https://github.com/apache/nutch/pull/802#issuecomment-1987165751 Patch applied to master in d95e1a7, see comments on Jira in NUTCH-3027. Thanks again @skehrli ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] fix for NUTCH-3027 contributed by skehrli [nutch]
skehrli opened a new pull request, #802: URL: https://github.com/apache/nutch/pull/802 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-1541 Indexer plugin to write CSV [nutch]
lewismc commented on PR #294: URL: https://github.com/apache/nutch/pull/294#issuecomment-1877232892 Hi @grege117 I’ll try to have a crack at this _soon_. Thanks for the heads up. If you feel like forking the branch and having a go at the fix, then please do. I will try to shepherd in your contribution if you can make one! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-1541 Indexer plugin to write CSV [nutch]
grege117 commented on PR #294: URL: https://github.com/apache/nutch/pull/294#issuecomment-1868000580 Sorry to chime in a few years late, but I'm not sure this plugin is configured correctly. If I modify my conf/index-writers.xml and remove everything except for ", you will get the message: IndexerOutputFormat [pool-5-thread-1] No IndexWriters activated - check your configuration The only way I could write to CSV was to execute what @sebastian-nagel wrote above: bin/nutch index -Dplugin.includes='indexer-csv' crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20231222132024/ -filter -normalize -deleteGone However, if I add back in the index-writer for SOLR, that just works (no -Dplugin.includes is required). So I think there's a bug here in the OOTB configuration that prevents indexer-csv working without specifying it on the CLI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]
derhecht opened a new pull request, #800: URL: https://github.com/apache/nutch/pull/800 Show --dedup-group instead of -dedup-group which have lead to misunderstanding output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3024 Remove flaky 'dependency check' target [nutch]
lewismc merged PR #795: URL: https://github.com/apache/nutch/pull/795 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]
tballison opened a new pull request, #799: URL: https://github.com/apache/nutch/pull/799 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] fix for NUTCH-2812 contributed by GabeHaegele [nutch]
GabeHaegele opened a new pull request, #798: URL: https://github.com/apache/nutch/pull/798 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
sebastian-nagel commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1802531264 Thanks, @jnioche! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
sebastian-nagel merged PR #796: URL: https://github.com/apache/nutch/pull/796 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1801938355 @sebastian-nagel merged the changes from master and made a few improvements -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]
sebastian-nagel commented on PR #793: URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549 Thanks, @jnioche! Merged into master, adding the lines to make use of Hadoop-provided compression codecs. Successfully tested in local and pseudo-distributed mode with various codecs (gzip / .gz, bzip2, ZStandard / .zst). One final note: if the fast-urlfilter is not found, the Nutch job (local mode) or the tasks (distributed mode) fail with an exception. I didn't change this behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]
sebastian-nagel closed pull request #793: [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 URL: https://github.com/apache/nutch/pull/793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1798221743 Writing a test for this thing is an absolute pain. The way the filters are used for real is that their method setConf is called and the rules are loaded using _getConfResourceAsReader_, i.e. they are expected to be in the jar. The tests do not rely on that mechanism and instead instantiate the filter with the reader for its rules. This means that the conf is not used at all and therefore we can't use that to load the value for the length based filters. I will add another constructor with the reader + conf so that we can test based on the length. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
jnioche commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384621727 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter { private Configuration conf; public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file"; + public static final String URLFILTER_FAST_PATH_MAX_LENGTH = "urlfilter.fast.url.path.max.length"; + public static final String URLFILTER_FAST_QUERY_MAX_LENGTH = "urlfilter.fast.url.query.max.length"; + Review Comment: I might keep things simple and just add a size limit on the whole URL regardless of its parts, similar to [what is done in StormCrawler.](https://github.com/DigitalPebble/storm-crawler/blob/ef31e509139cccb2919c345ef343c4fcfb2f1ec5/core/src/main/java/com/digitalpebble/stormcrawler/filtering/basic/BasicURLFilter.java#L30C17-L30C26) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
sebastian-nagel commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384536930 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter { private Configuration conf; public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file"; + public static final String URLFILTER_FAST_PATH_MAX_LENGTH = "urlfilter.fast.url.path.max.length"; + public static final String URLFILTER_FAST_QUERY_MAX_LENGTH = "urlfilter.fast.url.query.max.length"; + Review Comment: What about adding a third limit for path and query combined ([URL.getFile()](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html#getFile())? - if somebody defined two generous but reasonable limits (for example, 2048) for both path and query, the resulting URL may still get quite long and cause troubles - also the HTTP GET request includes both path and query - for many use cases it should sufficient to just set this limit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]
tballison merged PR #794: URL: https://github.com/apache/nutch/pull/794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]
tballison merged PR #797: URL: https://github.com/apache/nutch/pull/797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]
tballison commented on PR #797: URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171 ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, Errors: 0, Skipped: 4, Time elapsed: 4.342 sec 2023-11-06T15:02:48.2192793Z [junit] Test org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]
tballison commented on PR #797: URL: https://github.com/apache/nutch/pull/797#issuecomment-1794934171 Need to keep as draft until the 2.9.1.0 shim actually lands in maven central. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]
tballison opened a new pull request, #797: URL: https://github.com/apache/nutch/pull/797 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3024 Remove flaky 'dependency check' target [nutch]
lewismc opened a new pull request, #795: URL: https://github.com/apache/nutch/pull/795 Addresses https://issues.apache.org/jira/browse/NUTCH-3024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3014 Standardize Job names [nutch]
lewismc merged PR #789: URL: https://github.com/apache/nutch/pull/789 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3014 Standardize Job names [nutch]
lewismc commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r138646 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config) @Override protected int process(String line, StringBuilder output) throws Exception { -Job job = NutchJob.getInstance(getConf()); +Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + this.crawlDb); Review Comment: Thanks @sebastian-nagel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]
lewismc commented on PR #794: URL: https://github.com/apache/nutch/pull/794#issuecomment-1789810071 We have no tests for `ParseSegment` right now. I think it would be excellent if this PR could include a test for `ParseSegment.isTruncated`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]
tballison opened a new pull request, #794: URL: https://github.com/apache/nutch/pull/794 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]
sebastian-nagel commented on code in PR #793: URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -181,9 +186,23 @@ public String filter(String url) { public void reloadRules() throws IOException { String fileRules = conf.get(URLFILTER_FAST_FILE); -try (Reader reader = conf.getConfResourceAsReader(fileRules)) { - reloadRules(reader); + +InputStream is; + +Path fileRulesPath = new Path(fileRules); +if (fileRulesPath.toUri().getScheme() != null) { + FileSystem fs = fileRulesPath.getFileSystem(conf); + is = fs.open(fileRulesPath); +} Review Comment: Since we have Hadoop, could try all supported compression codecs (gzip, bzip2, zstd, etc.). Something such as (not tested): ```java CompressionCodec codec = new CompressionCodecFactory(conf).getCodec(fileRulesPath); if (codec != null) { is = codec.createInputStream(is); } ``` See [cf.getCodec(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec-org.apache.hadoop.fs.Path-) and [codec.createInputStream(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodec.html#createInputStream-java.io.InputStream-). If the rules file is contained in the job jar, it shouldn't be compressed anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org