[PR] NUTCH-3058 Fetcher: counter for hung threads [nutch]

2024-06-05 Thread via GitHub


sebastian-nagel opened a new pull request, #820:
URL: https://github.com/apache/nutch/pull/820

   - count the number of hung threads in a fetcher job
   - log and count the number of fetch items still queued when the "hard" 
timeout is reached


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3055 README: fix Github "hub" commands [nutch]

2024-05-28 Thread via GitHub


sebastian-nagel merged PR #818:
URL: https://github.com/apache/nutch/pull/818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-05-28 Thread via GitHub


sebastian-nagel merged PR #815:
URL: https://github.com/apache/nutch/pull/815


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3057 - Fix for index-arbitrary plugin improper retention and us… [nutch]

2024-05-17 Thread via GitHub


lewismc commented on PR #819:
URL: https://github.com/apache/nutch/pull/819#issuecomment-2118551238

   Thanks for reporting @CatChullain i didn’t catch this edge case either when 
reviewing or testing. 
   Out curiosity what does your deployment look like? Local or deploy?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3057 - Fix for index-arbitrary plugin improper retention and us… [nutch]

2024-05-17 Thread via GitHub


CatChullain opened a new pull request, #819:
URL: https://github.com/apache/nutch/pull/819

   Fix for NUTCH-3057 where index-arbitrary plugin retained value for a field 
and erroneously set it to the next field declared in its config stanzas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters [nutch]

2024-05-15 Thread via GitHub


lewismc merged PR #813:
URL: https://github.com/apache/nutch/pull/813


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Revert incorrect change [nutch-site]

2024-05-15 Thread via GitHub


lewismc commented on PR #2:
URL: https://github.com/apache/nutch-site/pull/2#issuecomment-2112989006

   Yes thank you @sebbASF 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-05-14 Thread via GitHub


sebastian-nagel commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2110558876

   Thanks, @lewismc! The metrics wiki page was updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-05-14 Thread via GitHub


sebastian-nagel merged PR #814:
URL: https://github.com/apache/nutch/pull/814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3039 Failure to handle ftp:// URLs [nutch]

2024-05-14 Thread via GitHub


sebastian-nagel merged PR #812:
URL: https://github.com/apache/nutch/pull/812


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Revert incorrect change [nutch-site]

2024-05-11 Thread via GitHub


sebastian-nagel commented on PR #2:
URL: https://github.com/apache/nutch-site/pull/2#issuecomment-2105982524

   Thanks, @sebbASF!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Revert incorrect change [nutch-site]

2024-05-11 Thread via GitHub


sebastian-nagel merged PR #2:
URL: https://github.com/apache/nutch-site/pull/2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Revert incorrect change [nutch-site]

2024-05-07 Thread via GitHub


sebbASF opened a new pull request, #2:
URL: https://github.com/apache/nutch-site/pull/2

   Nutch is currently not listed under the web-framework category on 
projects.apache.org


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3054 Address deprecation of Node16 for all GitHub Actions [nutch]

2024-04-30 Thread via GitHub


lewismc merged PR #817:
URL: https://github.com/apache/nutch/pull/817


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-1806 Delegate processing of URL domains to crawler-common [nutch]

2024-04-29 Thread via GitHub


sebastian-nagel opened a new pull request, #816:
URL: https://github.com/apache/nutch/pull/816

   and NUTCH-1942 Remove TopLevelDomain
   
   - use methods from crawler-commons' EffectiveTldFinder in URLUtil  replacing 
classed and methods from the "org.apache.nutch.util.domain" package
   
   - adapt and extend unit tests
 - add tests for URLUtil.getTopLevelDomainName(url)
 - reflect changes to the public suffix list since 2014 ("xyz" is now a 
public suffix / ICANN suffix)
 - adapt to minor API changes
- URLUtil.getDomainName(url) returns the host name in case no valid 
public suffix is found
- for Unicode suffixes and TLDs the methods 
URLUtil.getDomainSuffix(url) resp.  URLUtil.getTopLevelDomainName(url) now 
return the ASCII representation
  - add unit tests for host names with trailing dot ("www.apache.org.")
  - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit 
test for URLs without host/domain (cf. NUTCH-2450)
   
   - update and complete Javadoc
   
   - update DomainStatistics, TLDIndexingFilter and domain URL filters to use 
the updated methods in URLUtil
   - remove the class TLDScoringFilter. The configuration is bound to the 
domain-suffixes.xml which wasn't maintained anymore and is now removed
   - remove package org.apache.nutch.util.domain
   - move DomainStatistics to org.apache.nutch.util
   - remove configuration files of domain utils


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-04-28 Thread via GitHub


lewismc commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2081564107

   Excellent @sebastian-nagel +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-28 Thread via GitHub


lewismc commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229

   Excellent @sebastian-nagel 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-04-27 Thread via GitHub


sebastian-nagel commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2080743831

   ... also fixed the Javadoc error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-27 Thread via GitHub


sebastian-nagel commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2080634329

   Hi @lewismc:
   - "use parameterized logging": done
   - "augment the [metrics 
documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics) once 
this is merged.": will do
   - "we could also [create a test for the 
counters](https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial#MRUnitTutorial-TestingCounters).":
 for now, TestGenerator is not based on MRUNIT. The various 
Generator::generate(...) return the number of generated segments without a way 
to access the counters (they're logged, however). I'd prefer to track this in a 
separate issue, because it would require to many code changes to read the 
counters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-04-27 Thread via GitHub


sebastian-nagel commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2080603546

   > we could provide a TestGenerator#testNullHostInReducer test case
   
   Good idea! Done, see 4729786.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-25 Thread via GitHub


lewismc commented on code in PR #814:
URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313


##
src/java/org/apache/nutch/crawl/Generator.java:
##
@@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context 
context)
   try {
 sort = scfilters.generatorSortValue(key, crawlDatum, sort);
   } catch (ScoringFilterException sfe) {
-if (LOG.isWarnEnabled()) {
-  LOG.warn(
-  "Couldn't filter generatorSortValue for " + key + ": " + sfe);
-}
+LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe);

Review Comment:
   Please use parameterized logging.
   ```
   LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters [nutch]

2024-04-19 Thread via GitHub


lewismc commented on PR #813:
URL: https://github.com/apache/nutch/pull/813#issuecomment-2067543713

   The logging now looks as follows
   ```INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 1 URLExemptionFilter implementations: 
'[org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter@3090c372]’```.
   If no URLExemptionFilter implementations are found then no log statement is 
produced. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3039 Failure to handle ftp:// URLs [nutch]

2024-04-11 Thread via GitHub


sebastian-nagel opened a new pull request, #812:
URL: https://github.com/apache/nutch/pull/812

   Pass ftp:// URLs to the standard JVM URLStreamHandler


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3038 Address issues discovered during 1.20 release management dryrun [nutch]

2024-04-08 Thread via GitHub


lewismc merged PR #811:
URL: https://github.com/apache/nutch/pull/811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3038 Address issues discovered during 1.20 release management dryrun [nutch]

2024-04-05 Thread via GitHub


lewismc opened a new pull request, #811:
URL: https://github.com/apache/nutch/pull/811

   PR for https://issues.apache.org/jira/browse/NUTCH-3038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-04-04 Thread via GitHub


lewismc merged PR #810:
URL: https://github.com/apache/nutch/pull/810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


CatChullain commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028497765

   Thanks again, @lewismc. 
   
   I did add those INFO messages, but I found an extra call to setIndexedConf 
from setConf that the filter() method handles more cleanly, so I removed that,  
too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


lewismc commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028343327

   Hi @CatChullain I associated this Jira ticket to the 1.20 release and made 
you assignee  
   We will get it merged soon and roll the release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-30 Thread via GitHub


lewismc merged PR #808:
URL: https://github.com/apache/nutch/pull/808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-30 Thread via GitHub


lewismc merged PR #807:
URL: https://github.com/apache/nutch/pull/807


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 [nutch]

2024-03-30 Thread via GitHub


lewismc merged PR #809:
URL: https://github.com/apache/nutch/pull/809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


lewismc commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028304406

   @CatChullain thanks for your patience whilst we work this one  
   
   > … I wonder where might be good spots for INFO level messages
   
   The reason I suggested that the log level be revised from `INFO` to `DEBUG` 
was that any logging needs to make sense in the context of the entire log. Said 
another way, plugin logging needs to complement the core crawler tasks.
   That being said, if you want to include `INFO` for the following scenarios 
then please go ahead. 
   * recording the count value, and
   * indicating when overwrite is true
   Your rationale is sound. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


CatChullain commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028122038

   Thanks, Lewis! I moved all four to DEBUG, but I wonder where might be good 
spots for INFO level messages. I'm thinking of the operator or tech who doesn't 
dig into code and has an issue in the config. During dev & test myself, I 
sometimes forgot to increment the index.arbitrary.function.count and the plugin 
ignored the later fields. Just outputting that count value, and maybe something 
when overwrite is true, might be helpful for alerting someone that the config 
might not be what they'd believed.
   
   Do either of those (or something else) seem worthwhile, or does it make more 
sense to let people use it and see what issues they raise?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-29 Thread via GitHub


lewismc commented on code in PR #810:
URL: https://github.com/apache/nutch/pull/810#discussion_r1544806230


##
src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java:
##
@@ -0,0 +1,284 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.arbitrary;
+
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+
+import org.apache.hadoop.io.Text;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.lang.invoke.MethodHandles;
+import java.lang.Class;
+import java.lang.reflect.Constructor;
+import java.lang.reflect.Method;
+
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * Adds arbitrary searchable fields to a document from the class and method
+ * the user identifies in the config. The user supplies the name of the field
+ * to add with the class and method names that supply the value.
+ * 
+ * Example:
+ * property
+ *   nameindex.arbitrary.function.count/name
+ *   value1/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.fieldName.0/name
+ *   valueadvisors/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.className.0/name
+ *   valuecom.example.arbitrary.AdvisorCalculator/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.constructorArgs.0/name
+ *   valueKirk/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.methodName.0/name
+ *   valuecountAdvisors/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.methodArgs.0/name
+ *   valueSpock,McCoy/value
+ * /property
+ * 
+ * To set more than one arbitrary field value,
+ * increment {@code index.arbitrary.function.count} and
+ * repeat the rest of these blocks with successive int values
+ * appended to the property names, e.g. fieldName.1, methodName.1, etc.
+ */
+public class ArbitraryIndexingFilter implements IndexingFilter {
+
+  private static final Logger LOG = LoggerFactory
+.getLogger(MethodHandles.lookup().lookupClass());
+
+  /** How many arbitrary field definitions to set. */
+  private int arbitraryAddsCount = 0;
+  
+  /** The name of the field to insert/overwrite in the NutchDocument */
+  private String fieldName;
+  
+  /** The fully-qualified class name of the custom class to use for the
+   *  new field. This class must be in the Nutch runtime classpath,
+   *  e.g., nutch/lib/ dierctory. */
+  private String className;
+  
+  /** The String values to pass to the custom class constructor. The plugin
+   *  will add the document url as the first argument in className's
+   *  String[] args. */
+  private String[] userConstrArgs;
+  
+  /** The array where the plugin copies the url  the userConstrArgs
+   *  to create the instance of className. */
+  private String[] constrArgs;
+  
+  /** The name of the method in the custom class to call. Its return value
+   *  will become the value of fieldName in the NutchDocument. */
+  private String methodName;
+  
+  /** The String values of the arguments to methodName. It's up to the
+   *  developer of className to do any casts/conversions from String to
+   *  another class in the code of className. */
+  private String[] methodArgs;
+  
+  /** The result that returns from methodName. The plugin will set the value
+   *  of fieldName to this. */
+  private Object result;
+  
+  /** Optional flag to determine whether to overwrite the existing value in the
+   *  NutchDocument fieldName if this is set to true. Default behavior is to
+   *  add the value from calling methodName to existing values for fieldName. 
*/
+  private boolean overwrite = false;
+  
+  /** Hadoop Configuration object to pass these values into the plugin. */
+  private Configuration conf;
+
+  /**
+   * The {@link ArbitraryIndexingFilter} filter object uses reflection
+   * to instantiate the configured class and invoke the configured method.
+   * It requires a few configuration settings for adding arbitrary fields
+   * and values to the NutchDocument as searchable fields.

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-26 Thread via GitHub


CatChullain commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2021774505

   Thanks, Lewis! I got some of it done today. I'll consolidate the LOG 
statements a bit more tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-26 Thread via GitHub


lewismc commented on code in PR #810:
URL: https://github.com/apache/nutch/pull/810#discussion_r1539452666


##
src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java:
##
@@ -0,0 +1,266 @@
+package org.apache.nutch.indexer.arbitrary;
+
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlinks;
+
+import org.apache.hadoop.io.Text;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.lang.invoke.MethodHandles;
+import java.lang.Class;
+import java.lang.reflect.Constructor;
+import java.lang.reflect.Method;
+
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * Adds arbitrary searchable fields to a document from the class and method
+ * the user identifies in the config. The user supplies the name of the field
+ * to add with the class and method names that supply the value.
+ * 
+ * Example:
+ * property
+ *   nameindex.arbitrary.function.count/name
+ *   value1/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.fieldName.0/name
+ *   valueadvisors/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.className.0/name
+ *   valuecom.example.arbitrary.AdvisorCalculator/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.constructorArgs.0/name
+ *   valueKirk/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.methodName.0/name
+ *   valuecountAdvisors/value
+ * /property
+ * 
+ * property
+ *   nameindex.arbitrary.methodArgs.0/name
+ *   valueSpock,McCoy/value
+ * /property
+ * 
+ * To set more than one arbitrary field value,
+ * increment {@code index.arbitrary.function.count} and
+ * repeat the rest of these blocks with successive int values
+ * appended to the property names, e.g. fieldName.1, methodName.1, etc.
+ */
+public class ArbitraryIndexingFilter implements IndexingFilter {
+
+  private static final Logger LOG = LoggerFactory
+.getLogger(MethodHandles.lookup().lookupClass());
+
+  /** How many arbitrary field definitions to set. */
+  private int arbitraryAddsCount = 0;
+  
+  /** The name of the field to insert/overwrite in the NutchDocument */
+  private String fieldName;
+  
+  /** The fully-qualified class name of the custom class to use for the
+   *  new field. This class must be in the Nutch runtime classpath,
+   *  e.g., nutch/lib/ dierctory. */
+  private String className;
+  
+  /** The String values to pass to the custom class constructor. The plugin
+   *  will add the document url as the first argument in className's
+   *  String[] args. */
+  private String[] userConstrArgs;
+  
+  /** The array where the plugin copies the url  the userConstrArgs
+   *  to create the instance of className. */
+  private String[] constrArgs;
+  
+  /** The name of the method in the custom class to call. Its return value
+   *  will become the value of fieldName in the NutchDocument. */
+  private String methodName;
+  
+  /** The String values of the arguments to methodName. It's up to the
+   *  developer of className to do any casts/conversions from String to
+   *  another class in the code of className. */
+  private String[] methodArgs;
+  
+  /** The result that returns from methodName. The plugin will set the value
+   *  of fieldName to this. */
+  private Object result;
+  
+  /** Optional flag to determine whether to overwrite the existing value in the
+   *  NutchDocument fieldName if this is set to true. Default behavior is to
+   *  add the value from calling methodName to existing values for fieldName. 
*/
+  private boolean overwrite = false;
+  
+  /** Hadoop Configuration object to pass these values into the plugin. */
+  private Configuration conf;
+
+  /**
+   * The {@link ArbitraryIndexingFilter} filter object uses reflection
+   * to instantiate the configured class and invoke the configured method.
+   * It requires a few configuration settings for adding arbitrary fields
+   * and values to the NutchDocument as searchable fields.
+   * See {@code index.arbitrary.function.count}, and (possibly multiple
+   * instances when {@code index.arbitrary.function.count}  1) of the 
following
+   * {@code index.arbitrary.fieldName}.index,
+   * {@code index.arbitrary.className}.index,
+   * {@code index.arbitrary.constructorArgs}.index,
+   * {@code index.arbitrary.methodName}.index, and
+   * {@code index.arbitrary.methodArgs}.index
+   * in nutch-default.xml or nutch-site.xml where index ranges from 0
+   * to {@code index.arbitrary.function.count} - 1.
+   * 
+   * @param doc
+   *  The {@link NutchDocument} object
+   * @param parse
+   *  The relevant {@link Parse} object passing through the filter
+   * @param url
+   *  URL to be filtered by the user-specified class
+   * @param datum
+   *  The {@link 

Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-26 Thread via GitHub


lewismc commented on code in PR #810:
URL: https://github.com/apache/nutch/pull/810#discussion_r1539390873


##
src/plugin/index-arbitrary/ivy.xml:
##
@@ -0,0 +1,41 @@
+
+
+
+

Review Comment:
   Please remove whitespace. 



##
src/plugin/index-arbitrary/src/test/org/apache/nutch/indexer/arbitrary/Multiplier.java:
##
@@ -0,0 +1,31 @@
+package org.apache.nutch.indexer.arbitrary;

Review Comment:
   Please add ALv2 license header.



##
build.xml:
##
@@ -44,7 +44,7 @@
   
   
 
-  
+  

Review Comment:
   Please sync with `master` branch. This regression is tangential to 
NUTCH-3032. Thanks



##
src/plugin/index-arbitrary/src/test/org/apache/nutch/indexer/arbitrary/Echo.java:
##
@@ -0,0 +1,24 @@
+package org.apache.nutch.indexer.arbitrary;

Review Comment:
   Please add ALv2 license header.



##
src/plugin/index-arbitrary/ivy.xml:
##
@@ -0,0 +1,41 @@
+
+

Review Comment:
   Please remove whitespace. 



##
src/plugin/index-arbitrary/build.xml:
##
@@ -0,0 +1,6 @@
+

Review Comment:
   Please add ALv2 license header.



##
src/plugin/index-arbitrary/src/java/org/apache/nutch/indexer/arbitrary/ArbitraryIndexingFilter.java:
##
@@ -0,0 +1,266 @@
+package org.apache.nutch.indexer.arbitrary;

Review Comment:
   Please add ALv2 license header.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-25 Thread via GitHub


CatChullain opened a new pull request, #810:
URL: https://github.com/apache/nutch/pull/810

   This is the initial code for an arbitrary indexing filter, NUTCH-3032.
   
   It could be helpful to let end users manipulate information at indexing time 
with their own code without the need for writing their own indexing plugin. I 
mentioned this on the dev mailing list 
(https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
description of my work in progress.
   
   One potential use is to address some of the same concerns that NUTCH-585 
discusses regarding an alternative approach to picking and choosing which 
content to index, but this approach would allow making index time decisions, 
rather than setting the configuration for all content at the start of the 
indexing run.
   
   Ideally a solution to NUTCH-585 would work at parse time, but this index 
time code still has potential uses for any kind of look-ups or calculations 
that depend on values in the document where the need to manipulate data exceeds 
what something like Jexl filter can do easily, or where outside data is worth 
incorporating into the document for use after indexing.
   
   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-15 Thread via GitHub


sebastian-nagel commented on PR #808:
URL: https://github.com/apache/nutch/pull/808#issuecomment-2000233258

   Hi Lewis, it's done in three steps:
   1. run `ant report-licenses` (Rat task) for core and all plugins
   2. process all reports: list all combinations of , try to 
extract the organization and project description from the ivy cache, normalize 
license names, etc.
   3. manually verify the output of step 1 and 2 and merge it with the existing 
license and notice files
   
   Step 1 and 2 are done by the Jupyter notebook attached to NUTCH-2290. 
Because the output (NOTICE-binary and LICENSE-binary) is somewhat noisy, manual 
verification is necessary.
   
   See also NUTCH-2981: Storm has some scripts to automatically generate the 
license reports.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-15 Thread via GitHub


tballison closed pull request #799: NUTCH-3026 -- add statusOnly as an indexing 
option
URL: https://github.com/apache/nutch/pull/799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] WIP StatsD metrics example [nutch]

2024-03-14 Thread via GitHub


lewismc commented on PR #712:
URL: https://github.com/apache/nutch/pull/712#issuecomment-1998875276

   Closing this PR out. StatsD is widely used but open source Java SDK’s/agents 
are few and far between.
   When I get around to properly instrumenting Nutch I will probably suggest 
that we use [Apache SkyWalking](https://skywalking.apache.org/).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] WIP StatsD metrics example [nutch]

2024-03-14 Thread via GitHub


lewismc closed pull request #712: WIP StatsD metrics example
URL: https://github.com/apache/nutch/pull/712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub


lewismc closed pull request #807: NUTCH-3036 Upgrade 
org.seleniumhq.selenium:selenium-java dependency i…
URL: https://github.com/apache/nutch/pull/807


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub


lewismc commented on PR #807:
URL: https://github.com/apache/nutch/pull/807#issuecomment-1998718730

   There are some tangential proposed changes (such as improvements to logging) 
to this PR but they concern the relevant Class files. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-14 Thread via GitHub


lewismc commented on PR #808:
URL: https://github.com/apache/nutch/pull/808#issuecomment-1998717443

   Hi @sebastian-nagel did you perform this task manually?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-14 Thread via GitHub


lewismc closed pull request #808: NUTCH-3035 Update license and notice file for 
release of 1.20
URL: https://github.com/apache/nutch/pull/808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub


lewismc commented on PR #807:
URL: https://github.com/apache/nutch/pull/807#issuecomment-1998714969

   [Further guidance on browser compatibility/supported 
platforms](https://firefox-source-docs.mozilla.org/testing/geckodriver/Support.html)
   
   Along the way I discovered that **_full screenshots_** ar now handled 
differently so we need to rethink how to do this. For example, the 
[FirefoxDriver has a pretty elegant way of doing 
this](https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/firefox/FirefoxDriver.html#getFullPageScreenshotAs(org.openqa.selenium.OutputType))
 but it is different on other browsers.
   For the time being each browser can take a screenshot of the view 
window/partial webpage. This is satisfactory but there is room for improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-14 Thread via GitHub


lewismc commented on PR #807:
URL: https://github.com/apache/nutch/pull/807#issuecomment-1998711992

   PR ready or review. Tested on
   * MacBook Pro
   * Apple M1 Pro
   * Sonora 14.4 
   * Firefox 115.X (compatible with current version of Selenium)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-14 Thread via GitHub


sebastian-nagel opened a new pull request, #808:
URL: https://github.com/apache/nutch/pull/808

   Update the license and notice files of dependencies  included as binary jar 
files in the binary release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]

2024-03-14 Thread via GitHub


sebastian-nagel merged PR #806:
URL: https://github.com/apache/nutch/pull/806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]

2024-03-13 Thread via GitHub


lewismc commented on PR #806:
URL: https://github.com/apache/nutch/pull/806#issuecomment-1995922015

   Tested with ES 7.10.2 6 node cluster. +1 LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues [nutch]

2024-03-13 Thread via GitHub


sebastian-nagel opened a new pull request, #806:
URL: https://github.com/apache/nutch/pull/806

   This PR downgrades the ES client to version 7.10.2 which is licensed under 
ASF 2.0 - it's a quick fix to stay compatible with ASF policies.
   
   Not yet tested: indexing into ES
   
   To be done: update the LICENSE and NOTICE files. I'll do this as part of a 
separate issue NUTCH-3035.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-13 Thread via GitHub


lewismc commented on PR #803:
URL: https://github.com/apache/nutch/pull/803#issuecomment-1994562354

   Thanks @sebastian-nagel  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-13 Thread via GitHub


lewismc merged PR #803:
URL: https://github.com/apache/nutch/pull/803


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]

2024-03-12 Thread via GitHub


lewismc commented on PR #805:
URL: https://github.com/apache/nutch/pull/805#issuecomment-1993567784

   Thanks @derhecht  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]

2024-03-12 Thread via GitHub


lewismc merged PR #805:
URL: https://github.com/apache/nutch/pull/805


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-12 Thread via GitHub


lewismc commented on PR #803:
URL: https://github.com/apache/nutch/pull/803#issuecomment-1993545146

   After lots of trial and error I think I cracked this one. Ultimately there 
were several places where the optional `(-[classifier])` element has to be 
added to the `ivy:retrieve pattern`. 
   This wasn’t particularly intuitive as the ivy documentation is [somewhat 
lacking in this 
regard](https://ant.apache.org/ivy/history/2.5.2/resolver/filesystem.html#_child_elements)
 however @bodewig [pointed me in the right direction on the ivy-user@ mailing 
list](https://lists.apache.org/thread/fdd9r5gkdk5215hc9swcxhjwyvnzoz0w). Thank 
you for that @boedwig.
   
   This PR is ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-12 Thread via GitHub


lewismc closed pull request #803: NUTCH-3033 Upgrade Ivy to v2.5.2
URL: https://github.com/apache/nutch/pull/803


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Update Dockerfile / JAVA_HOME [nutch]

2024-03-12 Thread via GitHub


derhecht commented on PR #801:
URL: https://github.com/apache/nutch/pull/801#issuecomment-1991968018

   see #805 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Update Dockerfile / JAVA_HOME - 2nd try [nutch]

2024-03-12 Thread via GitHub


derhecht opened a new pull request, #805:
URL: https://github.com/apache/nutch/pull/805

   Alpine is using ash shell by default which results in an not set JAVA_HOME 
environment variable
   
   Sry, there is no issue reported atm on issues.apache.org - never the less, 
it is one I'm facing to
   
   see #801 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-11 Thread via GitHub


lewismc commented on PR #803:
URL: https://github.com/apache/nutch/pull/803#issuecomment-1989446036

   Hmmm, I upgraded to 2.5.1 and the CI runs just fine. Looks like there is 
some regression/additional configuration required with 2.5.2. I’m asking the 
question over on ivy-user@ mailing list. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-11 Thread via GitHub


lewismc commented on PR #799:
URL: https://github.com/apache/nutch/pull/799#issuecomment-1989404991

   Hmmm. It appears that there are problems with the `protocol-http` unit tests…
   ```
   [echo] Testing plugin: protocol-http
   [junit] Running org.apache.nutch.protocol.http.TestBadServerResponses
   [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.846 sec
   [junit] Running org.apache.nutch.parse.tika.TestHtmlParser
   [junit] Tests run: 9, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 
3.659 sec
   [junit] Test org.apache.nutch.protocol.http.TestBadServerResponses FAILED
   [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.599 sec
   [junit] Running org.apache.nutch.protocol.http.TestProtocolHttp
   [junit] Running org.apache.nutch.parse.tika.TestImageMetadata
   [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.074 sec
   [junit] Running org.apache.nutch.protocol.http.TestResponse
   [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
1.699 sec
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-11 Thread via GitHub


lewismc closed pull request #799: NUTCH-3026 -- add statusOnly as an indexing 
option
URL: https://github.com/apache/nutch/pull/799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Update Dockerfile / JAVA_HOME [nutch]

2024-03-11 Thread via GitHub


lewismc commented on PR #801:
URL: https://github.com/apache/nutch/pull/801#issuecomment-1989379558

   @derhecht apologies I merged this mistakenly. 
   Can you please submit the PR against master branch?
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2024-03-11 Thread via GitHub


lewismc commented on PR #799:
URL: https://github.com/apache/nutch/pull/799#issuecomment-1989380993

   Reopening to have CI run again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Revert "Update Dockerfile / JAVA_HOME" [nutch]

2024-03-11 Thread via GitHub


lewismc merged PR #804:
URL: https://github.com/apache/nutch/pull/804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Revert "Update Dockerfile / JAVA_HOME" [nutch]

2024-03-11 Thread via GitHub


lewismc opened a new pull request, #804:
URL: https://github.com/apache/nutch/pull/804

   Reverts apache/nutch#801


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Update Dockerfile / JAVA_HOME [nutch]

2024-03-11 Thread via GitHub


lewismc merged PR #801:
URL: https://github.com/apache/nutch/pull/801


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-11 Thread via GitHub


lewismc commented on PR #803:
URL: https://github.com/apache/nutch/pull/803#issuecomment-1989365064

   OK so it looks like the [newer Ivy version is being used just 
fine](https://github.com/apache/nutch/actions/runs/8239165168/job/22531780061?pr=803#step:4:78).
 The build did however fail with the following
   ```
   Caused by: java.lang.RuntimeException: Multiple artifacts of the module 
io.netty#netty-transport-native-kqueue;4.1.84.Final are retrieved to the same 
file! Update the retrieve pattern to fix this error.
   ```
   … investigating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3033 Upgrade Ivy to v2.5.2 [nutch]

2024-03-11 Thread via GitHub


lewismc opened a new pull request, #803:
URL: https://github.com/apache/nutch/pull/803

   PR for https://issues.apache.org/jira/browse/NUTCH-3033
   I was having trouble locally resolving the Ivy version to 2.5.2… I can’t yet 
figure out why 2.5.1 was being used. I’ll check out the CI log and see if the 
newer version is used. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]

2024-03-10 Thread via GitHub


sebastian-nagel commented on PR #800:
URL: https://github.com/apache/nutch/pull/800#issuecomment-1987171023

   Thanks, @derhecht! Good catch!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]

2024-03-10 Thread via GitHub


sebastian-nagel merged PR #800:
URL: https://github.com/apache/nutch/pull/800


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] fix for NUTCH-3027 contributed by skehrli [nutch]

2024-03-10 Thread via GitHub


sebastian-nagel closed pull request #802: fix for NUTCH-3027 contributed by 
skehrli
URL: https://github.com/apache/nutch/pull/802


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] fix for NUTCH-3027 contributed by skehrli [nutch]

2024-03-10 Thread via GitHub


sebastian-nagel commented on PR #802:
URL: https://github.com/apache/nutch/pull/802#issuecomment-1987165751

   Patch applied to master in d95e1a7, see comments on Jira in NUTCH-3027. 
Thanks again @skehrli !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] fix for NUTCH-3027 contributed by skehrli [nutch]

2024-01-18 Thread via GitHub


skehrli opened a new pull request, #802:
URL: https://github.com/apache/nutch/pull/802

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-1541 Indexer plugin to write CSV [nutch]

2024-01-04 Thread via GitHub


lewismc commented on PR #294:
URL: https://github.com/apache/nutch/pull/294#issuecomment-1877232892

   Hi @grege117 I’ll try to have a crack at this _soon_. Thanks for the heads 
up. If you feel like forking the branch and having a go at the fix, then please 
do. I will try to shepherd in your contribution if you can make one!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-1541 Indexer plugin to write CSV [nutch]

2023-12-22 Thread via GitHub


grege117 commented on PR #294:
URL: https://github.com/apache/nutch/pull/294#issuecomment-1868000580

   Sorry to chime in a few years late, but I'm not sure this plugin is 
configured correctly.
   
   If I modify my conf/index-writers.xml and remove everything except for 
", you will get the message:
   
   IndexerOutputFormat [pool-5-thread-1] No IndexWriters activated - check your 
configuration
   
   The only way I could write to CSV was to execute what @sebastian-nagel  
wrote above:
   bin/nutch index -Dplugin.includes='indexer-csv' crawl/crawldb/ -linkdb 
crawl/linkdb/ crawl/segments/20231222132024/ -filter -normalize -deleteGone
   
   
   However, if I add back in the index-writer for SOLR, that just works (no 
-Dplugin.includes is required).  
   
   So I think there's a bug here in the OOTB configuration that prevents 
indexer-csv working without specifying it on the CLI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [NUTCH-2834] Update crawl documentation / Fix #557 [nutch]

2023-12-14 Thread via GitHub


derhecht opened a new pull request, #800:
URL: https://github.com/apache/nutch/pull/800

   Show --dedup-group instead of -dedup-group which have lead to 
misunderstanding output
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3024 Remove flaky 'dependency check' target [nutch]

2023-11-24 Thread via GitHub


lewismc merged PR #795:
URL: https://github.com/apache/nutch/pull/795


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3026 -- add statusOnly as an indexing option [nutch]

2023-11-17 Thread via GitHub


tballison opened a new pull request, #799:
URL: https://github.com/apache/nutch/pull/799

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] fix for NUTCH-2812 contributed by GabeHaegele [nutch]

2023-11-08 Thread via GitHub


GabeHaegele opened a new pull request, #798:
URL: https://github.com/apache/nutch/pull/798

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1802531264

   Thanks, @jnioche!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel merged PR #796:
URL: https://github.com/apache/nutch/pull/796


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub


jnioche commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1801938355

   @sebastian-nagel merged the changes from master and made a few improvements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel commented on PR #793:
URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549

   Thanks, @jnioche!
   
   Merged into master, adding the lines to make use of Hadoop-provided 
compression codecs.
   
   Successfully tested in local and pseudo-distributed mode with various codecs 
(gzip / .gz, bzip2, ZStandard / .zst).
   
   One final note: if the fast-urlfilter is not found, the Nutch job (local 
mode) or the tasks (distributed mode) fail with an exception. I didn't change 
this behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel closed pull request #793: [NUTCH-3017] Allow fast-urlfilter to 
load from HDFS/S3 
URL: https://github.com/apache/nutch/pull/793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub


jnioche commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1798221743

   Writing a test for this thing is an absolute pain. The way the filters are 
used for real is that their method setConf is called and the rules are loaded 
using _getConfResourceAsReader_, i.e. they are expected to be in the jar.
   The tests do not rely on that mechanism and instead instantiate the filter 
with the reader for its rules. This means that the conf is not used at all and 
therefore we can't use that to load the value for the length based filters. I 
will add another constructor with the reader + conf so that we can test based 
on the length.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub


jnioche commented on code in PR #796:
URL: https://github.com/apache/nutch/pull/796#discussion_r1384621727


##
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java:
##
@@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {
 
   private Configuration conf;
   public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file";
+  public static final String URLFILTER_FAST_PATH_MAX_LENGTH = 
"urlfilter.fast.url.path.max.length";
+  public static final String URLFILTER_FAST_QUERY_MAX_LENGTH = 
"urlfilter.fast.url.query.max.length";
+  

Review Comment:
   I might keep things simple and just add a size limit on the whole URL 
regardless of its parts, similar to [what is done in 
StormCrawler.](https://github.com/DigitalPebble/storm-crawler/blob/ef31e509139cccb2919c345ef343c4fcfb2f1ec5/core/src/main/java/com/digitalpebble/stormcrawler/filtering/basic/BasicURLFilter.java#L30C17-L30C26)
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub


sebastian-nagel commented on code in PR #796:
URL: https://github.com/apache/nutch/pull/796#discussion_r1384536930


##
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java:
##
@@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {
 
   private Configuration conf;
   public static final String URLFILTER_FAST_FILE = "urlfilter.fast.file";
+  public static final String URLFILTER_FAST_PATH_MAX_LENGTH = 
"urlfilter.fast.url.path.max.length";
+  public static final String URLFILTER_FAST_QUERY_MAX_LENGTH = 
"urlfilter.fast.url.query.max.length";
+  

Review Comment:
   What about adding a third limit for path and query combined 
([URL.getFile()](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html#getFile())?
   - if somebody defined two generous but reasonable limits (for example, 2048) 
for both path and query, the resulting URL may still get quite long and cause 
troubles 
   - also the HTTP GET request includes both path and query
   - for many use cases it should sufficient to just set this limit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-06 Thread via GitHub


tballison merged PR #794:
URL: https://github.com/apache/nutch/pull/794


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison merged PR #797:
URL: https://github.com/apache/nutch/pull/797


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171

   ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, 
Errors: 0, Skipped: 4, Time elapsed: 4.342 sec
   2023-11-06T15:02:48.2192793Z [junit] Test 
org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1794934171

   Need to keep as draft until the 2.9.1.0 shim actually lands in maven central.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison opened a new pull request, #797:
URL: https://github.com/apache/nutch/pull/797

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3024 Remove flaky 'dependency check' target [nutch]

2023-11-03 Thread via GitHub


lewismc opened a new pull request, #795:
URL: https://github.com/apache/nutch/pull/795

   Addresses https://issues.apache.org/jira/browse/NUTCH-3024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-11-02 Thread via GitHub


lewismc merged PR #789:
URL: https://github.com/apache/nutch/pull/789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-11-02 Thread via GitHub


lewismc commented on code in PR #789:
URL: https://github.com/apache/nutch/pull/789#discussion_r138646


##
src/java/org/apache/nutch/crawl/CrawlDbReader.java:
##
@@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, 
Configuration config)
 
   @Override
   protected int process(String line, StringBuilder output) throws Exception {
-Job job = NutchJob.getInstance(getConf());
+Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + 
this.crawlDb);

Review Comment:
   Thanks @sebastian-nagel 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-01 Thread via GitHub


lewismc commented on PR #794:
URL: https://github.com/apache/nutch/pull/794#issuecomment-1789810071

   We have no tests for `ParseSegment` right now. I think it would be excellent 
if this PR could include a test for `ParseSegment.isTruncated`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-01 Thread via GitHub


tballison opened a new pull request, #794:
URL: https://github.com/apache/nutch/pull/794

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-10-31 Thread via GitHub


sebastian-nagel commented on code in PR #793:
URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552


##
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java:
##
@@ -181,9 +186,23 @@ public String filter(String url) {
 
   public void reloadRules() throws IOException {
 String fileRules = conf.get(URLFILTER_FAST_FILE);
-try (Reader reader = conf.getConfResourceAsReader(fileRules)) {
-  reloadRules(reader);
+
+InputStream is;
+
+Path fileRulesPath = new Path(fileRules);
+if (fileRulesPath.toUri().getScheme() != null) {
+  FileSystem fs = fileRulesPath.getFileSystem(conf);
+  is = fs.open(fileRulesPath);
+}

Review Comment:
   Since we have Hadoop, could try all supported compression codecs (gzip, 
bzip2, zstd, etc.). Something such as (not tested):
   ```java
   CompressionCodec codec = new 
CompressionCodecFactory(conf).getCodec(fileRulesPath);
   if (codec != null) {
  is = codec.createInputStream(is);
   }
   ```
   See 
[cf.getCodec(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec-org.apache.hadoop.fs.Path-)
 and 
[codec.createInputStream(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodec.html#createInputStream-java.io.InputStream-).
   
   If the rules file is contained in the job jar, it shouldn't be compressed 
anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >