[jira] [Updated] (NUTCH-3087) Nutch crawling inconsistent on URLs with userinfo

2024-12-04 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3087:
---
Fix Version/s: 1.21

> Nutch crawling inconsistent on URLs with userinfo
> -
>
> Key: NUTCH-3087
> URL: https://issues.apache.org/jira/browse/NUTCH-3087
> Project: Nutch
>  Issue Type: Bug
>  Components: urlnormalizer
>Affects Versions: 1.21
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> I am trying to scan the URL
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> Note the userinfo 'hiran', which is used for authentication on the server. 
> (The smb plugin pulls credentials from another configuration file, but this 
> is irrelevant here).
> The URL is fetched, parsed, updated in the crawldb and sent to the indexer. 
> So far so good. But the outlinks that are detected are of different quality: 
> some have the userinfo preserved, some are missing that information.
> Dumping the segment I can see the below data. Note that some of the outlinks 
> start with smb://hi...@nas.fritz.box, while others start with 
> smb://nas.fritz.box. The impact is that on the next fetch run authentication 
> information is missing and the URLs cannot be fetched further.
>  
> {code:java}
> Recno:: 0
> URL:: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Oct 29 22:56:58 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata: 
>      _ngt_=1730239026566
> Content::
> Version: -1
> url: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> base: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.
> contentType: text/html
> metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0
> Content:
> Index of 
> /Documents/Hiran/Monitoring/Index of 
> /Documents/Hiran/Monitoring/.svn/    Tue Oct 24 
> 13:32:32 CEST 2017
> architektur.dia    Mon Feb 22 21:30:33 CET 2010
> architektur.dia~    Mon Feb 22 21:20:42 CET 
> 2010
> architektur.png    Mon Feb 22 21:34:27 CET 2010
> deployment.dia    Mon Feb 22 22:56:15 CET 2010
> deployment.dia~    Mon Feb 22 22:51:21 CET 
> 2010
> deployment.png    Mon Feb 22 23:00:34 CET 2010
> Monitoring strategy.odt    Fri Aug 01 
> 13:38:04 CEST 2014
> 
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /Documents/Hiran/Monitoring/
> Outlinks: 5
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: 
> architektur.dia Mon Feb 22 21:30:33 CET 2010
>   outlink: toUrl: 
> smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor: 
> architektur.dia~ Mon Feb 22 21:20:42 CET 2010
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: 
> deployment.dia Mon Feb 22 22:56:15 CET 2010
>   outlink: toUrl: 
> smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor: 
> deployment.dia~ Mon Feb 22 22:51:21 CET 2010
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt 
> anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014
> Content Metadata:
>   nutch.segment.name = 20241029225708
>   nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc
>   _fst_ = 33
>   nutch.crawl.score = 1.0
> Parse Metadata:
>   CharEncodingForConversion = windows-1252
>   OriginalCharEncoding = windows-1252
>   language = en
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Oct 29 22:57:25 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 0.0
> Signature: a794c6675cb2f9e460e7771060ed2dfc
> Metadata: 
>  
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Oct 29 22:57:17 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata: 
>      _ngt_=1730239026566
>     _pst_=success(1), lastModified=0
>     Content-Type=text/html
> ParseText::
> Index of /Documents/Hiran/Monitoring/
> Index of /Documents/Hiran/Monitoring/
> .svn/ Tue Oct 24 13:32:32 CEST 2017
> architektur.dia Mon Feb 22 21:30:33 CET 2010
> architektur.dia~ Mon Feb 22 21:20:42 CET 2010
> architektur.png Mon Feb 22 21:34:27 CET 2

Re: [VOTE] Create public archive of nutch-webapp

2024-12-04 Thread Sebastian Nagel

+1

(sorry for the late vote)

On 10/16/24 19:25, Tim Allison wrote:

+1

Thank you, Lewis!

On Wed, Oct 16, 2024 at 10:45 AM lewis john mcgibbney
 wrote:


Hi dev@,

I was recently encouraged to look at the nutch-webapp [0] repository after a 
number of years. It hasn't been touched in 3 years and no official release 
artifact(s) have been released since we spun it out of the core Nutch codebase 
[1]. In addition, a growing collection of Dependabot PR's [2] and security 
alerts [3] exist. I know that the WebApp is feature incomplete. Although we've 
made it easier to consume via the Nutch Dockerfile [4] this tends to result in 
users being confused as to the maturity as they quickly experience 
runtime/operational issues and need to seek assistance/guidance.

I would therefore like to open a VOTE to retire the nutch-webapp repository 
[0]. The VOTE will be open for at least 72 hours.

[ ] +1 retire the nutch-webapp repository
[ ] +/-0 retire the nutch-webapp repository
[ ] -1 DO NOT retire the nutch-webapp repository (please explain your VOTE)

Thanks in advance for anyone who VOTE's.
lewismc

P.S. Here's my +1

[0] https://github.com/apache/nutch-webapp
[1] https://issues.apache.org/jira/browse/NUTCH-2886
[2] https://github.com/apache/nutch-webapp/pulls
[3] https://github.com/apache/nutch-webapp/security/dependabot
[4] https://github.com/apache/nutch/blob/master/docker/Dockerfile#L87-L102

--
http://people.apache.org/keys/committer/lewismc




[jira] [Resolved] (NUTCH-3079) Dumping a segment fails unless it has been fetched and parsed

2024-12-04 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3079.

  Assignee: Hiran Chaudhuri
Resolution: Fixed

Fixed in PR [#837|https://github.com/apache/nutch/pull/837]. Thanks, 
[~hiranchaudhuri]!

> Dumping a segment fails unless it has been fetched and parsed
> -
>
> Key: NUTCH-3079
> URL: https://issues.apache.org/jira/browse/NUTCH-3079
> Project: Nutch
>  Issue Type: Bug
> Environment: Ubuntu 22 LTS
> $ $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>Reporter: Hiran Chaudhuri
>Assignee: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> On some existing crawldb generate a new segment:
> {{./local/bin/nutch generate crawl/crawldb crawl/segments}}
> {{...}}
> {{2024-10-14 07:58:58,589 INFO org.apache.nutch.crawl.Generator [main] 
> Generator: segment: crawl/segments/20241014075858}}
> {{2024-10-14 07:58:59,731 INFO org.apache.nutch.crawl.Generator [main] 
> Generator: finished, elapsed: 3423 ms}}
> Then try to dump this new segment:
> {{./local/bin/nutch readseg -dump crawl/segments/20241014075858 
> crawl/log/dumpsegment}}
> {{This errors out with}}
> {{2024-10-14 08:01:10,448 INFO org.apache.nutch.segment.SegmentReader [main] 
> SegmentReader: dump segment: crawl/segments/20241014075858}}
> {{2024-10-14 08:01:10,705 ERROR org.apache.nutch.segment.SegmentReader [main] 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/content}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_data}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_text}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
> {{    at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:714)}}
> {{    at java.base/javax.security.auth.Subject.doAs(Subject.java:525)}}
> {{    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
> {{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
> {{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
> {{    at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:238)}}
> {{    at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:677)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:765)}}
> {{Caused by: java.io.IOException: Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
> {{{}    ... 17 more{}}}{{{}Exception in thread "main" 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch{}}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_p

[jira] [Resolved] (NUTCH-3083) Add RobotRulesParser to bin/nutch

2024-12-04 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3083.

Resolution: Implemented

> Add RobotRulesParser to bin/nutch
> -
>
> Key: NUTCH-3083
> URL: https://issues.apache.org/jira/browse/NUTCH-3083
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> The main method of the class {{org.apache.nutch.protocol.RobotRulesParser}} 
> is quite useful if it's about verifying whether and how robots.txt files are 
> parsed. It should be added to bin/nutch as *robotsparser*, similar to 
> "parsechecker", "filterchecker", etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3091) Allow URL filters to flag an existing URL to delete from index

2024-12-04 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903045#comment-17903045
 ] 

Sebastian Nagel commented on NUTCH-3091:


Hi [~marcos], thanks for the contribution!

I've tested the patch and got a NPE:
{noformat}
java.lang.Exception: java.lang.NullPointerException
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) 
~[hadoop-mapreduce-client-common-3.3.6.jar:?]
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) 
[hadoop-mapreduce-client-common-3.3.6.jar:?]
Caused by: java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:497) 
~[hadoop-common-3.3.6.jar:?]
at org.apache.hadoop.io.Text.set(Text.java:212) 
~[hadoop-common-3.3.6.jar:?]
at 
org.apache.nutch.indexer.IndexerMapReduce$IndexerMapper.map(IndexerMapReduce.java:194)
 ~[apache-nutch-1.21-SNAPSHOT.jar:?]
at 
org.apache.nutch.indexer.IndexerMapReduce$IndexerMapper.map(IndexerMapReduce.java:155)
 ~[apache-nutch-1.21-SNAPSHOT.jar:?]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
~[hadoop-mapreduce-client-core-3.3.6.jar:?]
{noformat}
{code:java}
if (urlString == null && !filterDelete) {
return;
} else {
key.set(urlString); // << NPE thrown here if URL is filtered (is null) and 
filterDelete == true
}
{code}
Could you fix this NPE? And I'd strongly recommend, the updated patch 
beforehand. You do not need to set up Solr - there's indexer-dummy which makes 
it very easy to verify whether index additions or deletions are the expected 
ones. I've just run on some test data:
{code:bash}
bin/nutch index -Dindexer.delete.by.url.filters=true 
-Dplugin.includes='indexer-dummy|indexing-basic|urlfilter-regex' crawldb -dir 
segments
{code}

Further remarks:
- the indentation does not correspond to our [code-formatting 
template|https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml]. 
Could you apply the template? Otherwise let us know and we can do it before 
committing the patch. Thanks!
- some minimal documentation is required. At least, the new property needs to 
be described in conf/nutch-default.xml with a default value (false)
- eventually, it might be an option to add the option as a command-line 
argument to the "index" job (IndexingJob.java)

> Allow URL filters to flag an existing URL to delete from index
> --
>
> Key: NUTCH-3091
> URL: https://issues.apache.org/jira/browse/NUTCH-3091
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, urlfilter
>Affects Versions: 1.20
>Reporter: Marcos Gomez
>Priority: Major
> Attachments: patch_delete_by_url_filter.patch
>
>
> When in the crawldb there are already URLs that when updating the 
> configuration of one of the URLFilter plugins are rejected, in the index 
> phase, but they are not removed from the index as is done with the ‘gone’ or 
> ‘redirects’.
> Currently there is a ‘-filter’ flag that prevents these URLs from being 
> processed, but they are not removed, it should be possible to apply a new 
> option or parameter.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3096) HostDB ResolverThread can create too many job counters

2024-12-04 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902987#comment-17902987
 ] 

Sebastian Nagel commented on NUTCH-3096:


+1 tested successfully

- there is a missing semicolon!
- shall I commit the patch?

> HostDB ResolverThread can create too many job counters
> --
>
> Key: NUTCH-3096
> URL: https://issues.apache.org/jira/browse/NUTCH-3096
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.20
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3096-1.15.patch, NUTCH-3096-1.patch, 
> NUTCH-3096.patch
>
>
> Hadoop will allow no more than 120 distinct counters. If we have a large 
> number of distinct DNS lookup failure counts, we'll exceed the limit, Hadoop 
> will complain,  the job will fail.
>  
> Let's limit the amount of possibilities by grouping the numFailures in just a 
> few buckets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3096) HostDB ResolverThread can create too many job counters

2024-12-04 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3096.

Resolution: Fixed

Committed in 
[5263b7c|https://github.com/apache/nutch/commit/5263b7cbea0a50bf0bb3324f139f2ad3030f6875].

> HostDB ResolverThread can create too many job counters
> --
>
> Key: NUTCH-3096
> URL: https://issues.apache.org/jira/browse/NUTCH-3096
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.20
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3096-1.15.patch, NUTCH-3096-1.patch, 
> NUTCH-3096.patch
>
>
> Hadoop will allow no more than 120 distinct counters. If we have a large 
> number of distinct DNS lookup failure counts, we'll exceed the limit, Hadoop 
> will complain,  the job will fail.
>  
> Let's limit the amount of possibilities by grouping the numFailures in just a 
> few buckets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3096) HostDB ResolverThread can create too many job counters

2024-12-04 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903023#comment-17903023
 ] 

Sebastian Nagel commented on NUTCH-3096:


> how did that compile?!?

I've added the missing semicolon. The fix was obvious!

I'm going to commit the patch. Thanks, [~markus17]!

> HostDB ResolverThread can create too many job counters
> --
>
> Key: NUTCH-3096
> URL: https://issues.apache.org/jira/browse/NUTCH-3096
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.20
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3096-1.15.patch, NUTCH-3096-1.patch, 
> NUTCH-3096.patch
>
>
> Hadoop will allow no more than 120 distinct counters. If we have a large 
> number of distinct DNS lookup failure counts, we'll exceed the limit, Hadoop 
> will complain,  the job will fail.
>  
> Let's limit the amount of possibilities by grouping the numFailures in just a 
> few buckets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3097) Plugin indexer-elastic throws ClassNotFoundException due to invalid dependencies

2024-12-03 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902717#comment-17902717
 ] 

Sebastian Nagel commented on NUTCH-3097:


+1 lgtm.

[~mpuzianowski], you've run the Nutch index job in distributed mode (on a 
Hadoop cluster)?

In distributed mode, it might be dangerous to suppress dependencies which are 
provided by Nutch core, because the class path is set in parts by the YARN 
job/task configuration. Good catch and thanks!

> Plugin indexer-elastic throws ClassNotFoundException due to invalid 
> dependencies
> 
>
> Key: NUTCH-3097
> URL: https://issues.apache.org/jira/browse/NUTCH-3097
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Maciej Puzianowski
>Priority: Major
>  Labels: easyfix
>
> In Apache Nutch 1.20, when using indexer-elastic plugin, IndexerJob throws a 
> ClassNotFoundException:
> {code:java}
> Error: java.lang.ClassNotFoundException: org.apache.logging.log4j.Level
>         at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
>         at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
>         at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
>         at 
> org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:105)
>         at 
> org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:93)
>         at 
> org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:73)
>         at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
>         at 
> org.elasticsearch.common.logging.DeprecationLogger.(DeprecationLogger.java:45)
>         at 
> org.elasticsearch.common.util.concurrent.EsExecutors.(EsExecutors.java:49)
>         at 
> org.elasticsearch.threadpool.Scheduler.initScheduler(Scheduler.java:56)
>         at 
> org.elasticsearch.action.bulk.BulkProcessor.builder(BulkProcessor.java:238)
>         at 
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.open(ElasticIndexWriter.java:149)
>         at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:216)
>         at 
> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:44)
>         at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)
>         at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) {code}
> I have found a solution that I would like to commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3094) Github tests to run if build configuration changes

2024-12-03 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902628#comment-17902628
 ] 

Sebastian Nagel commented on NUTCH-3094:


Ok. There was a typo in one of the branches, writing "plugin" instead of 
"plugins". :D Fix now.

> Github tests to run if build configuration changes
> --
>
> Key: NUTCH-3094
> URL: https://issues.apache.org/jira/browse/NUTCH-3094
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> If the build configuration changes, the unit tests should be run. Seen 
> together with NUTCH-3093 and the workflow: 
> https://github.com/apache/nutch/actions/runs/12048862236/job/33594395141?pr=840



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3094) Github tests to run if build configuration changes

2024-12-03 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3094.

Resolution: Fixed

> Github tests to run if build configuration changes
> --
>
> Key: NUTCH-3094
> URL: https://issues.apache.org/jira/browse/NUTCH-3094
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> If the build configuration changes, the unit tests should be run. Seen 
> together with NUTCH-3093 and the workflow: 
> https://github.com/apache/nutch/actions/runs/12048862236/job/33594395141?pr=840



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3094) Github tests to run if build configuration changes

2024-12-03 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902607#comment-17902607
 ] 

Sebastian Nagel commented on NUTCH-3094:


Reopened because Java code changes in both core and plugins didn't trigger the 
unit tests, see 
https://github.com/apache/nutch/pull/839#issuecomment-2514340781.

> Github tests to run if build configuration changes
> --
>
> Key: NUTCH-3094
> URL: https://issues.apache.org/jira/browse/NUTCH-3094
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.21
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> If the build configuration changes, the unit tests should be run. Seen 
> together with NUTCH-3093 and the workflow: 
> https://github.com/apache/nutch/actions/runs/12048862236/job/33594395141?pr=840



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-3094) Github tests to run if build configuration changes

2024-12-03 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-3094:

  Assignee: Sebastian Nagel

> Github tests to run if build configuration changes
> --
>
> Key: NUTCH-3094
> URL: https://issues.apache.org/jira/browse/NUTCH-3094
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> If the build configuration changes, the unit tests should be run. Seen 
> together with NUTCH-3093 and the workflow: 
> https://github.com/apache/nutch/actions/runs/12048862236/job/33594395141?pr=840



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3094) Github tests to run if build configuration changes

2024-12-03 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3094.

Resolution: Fixed

> Github tests to run if build configuration changes
> --
>
> Key: NUTCH-3094
> URL: https://issues.apache.org/jira/browse/NUTCH-3094
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> If the build configuration changes, the unit tests should be run. Seen 
> together with NUTCH-3093 and the workflow: 
> https://github.com/apache/nutch/actions/runs/12048862236/job/33594395141?pr=840



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3096) HostDB ResolverThread can create too many job counters

2024-12-03 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902588#comment-17902588
 ] 

Sebastian Nagel commented on NUTCH-3096:


+1 lgtm.

Could move the bucketing code into a function / method. The 
ResolverThread.run() method contains a try-catch block with most of the code in 
the catch part. Not easy to read.

> HostDB ResolverThread can create too many job counters
> --
>
> Key: NUTCH-3096
> URL: https://issues.apache.org/jira/browse/NUTCH-3096
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.20
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3096-1.15.patch, NUTCH-3096.patch
>
>
> Hadoop will allow no more than 120 distinct counters. If we have a large 
> number of distinct DNS lookup failure counts, we'll exceed the limit, Hadoop 
> will complain,  the job will fail.
>  
> Let's limit the amount of possibilities by grouping the numFailures in just a 
> few buckets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3095) Update .gitignore to ignore Hadoop native libraries

2024-12-03 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3095.

Resolution: Fixed

> Update .gitignore to ignore Hadoop native libraries
> ---
>
> Key: NUTCH-3095
> URL: https://issues.apache.org/jira/browse/NUTCH-3095
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> Hadoop native libraries can be installed into lib/native/ to use them also in 
> local mode, see the README.txt there. If they are installed, they should be 
> ignored by git.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3095) Update .gitignore to ignore Hadoop native libraries

2024-11-27 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3095:
--

 Summary: Update .gitignore to ignore Hadoop native libraries
 Key: NUTCH-3095
 URL: https://issues.apache.org/jira/browse/NUTCH-3095
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


Hadoop native libraries can be installed into lib/native/ to use them also in 
local mode, see the README.txt there. If they are installed, they should be 
ignored by git.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3093) Ant target test-plugins to depend on compile-core-test

2024-11-27 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3093:
--

 Summary: Ant target test-plugins to depend on compile-core-test
 Key: NUTCH-3093
 URL: https://issues.apache.org/jira/browse/NUTCH-3093
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.21
Reporter: Sebastian Nagel
 Fix For: 1.21


The ant target "test-plugins" must depend on the target "compile-core-test" 
because test classes may import core test classes:
{noformat}
$> ant clean test-plugins
...
[javac] 
.../src/plugin/protocol-okhttp/src/test/org/apache/nutch/protocol/okhttp/TestBadServerResponses.java:29:
 error: cannot find symbol
[javac] import org.apache.nutch.protocol.AbstractHttpProtocolPluginTest;
[javac] ^
[javac]   symbol:   class AbstractHttpProtocolPluginTest
[javac]   location: package org.apache.nutch.protocol
{noformat}

Note: when running {{ant test}} the core test classes are compiled for the 
target "test-core" and stay available for the plugin tests.

This issue has shown up in the automated tests on Github after NUTCH-3084 which 
runs conditionally on code changes in core or plugins source folders. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3094) Github tests to run if build configuration changes

2024-11-27 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3094:
--

 Summary: Github tests to run if build configuration changes
 Key: NUTCH-3094
 URL: https://issues.apache.org/jira/browse/NUTCH-3094
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.21
Reporter: Sebastian Nagel
 Fix For: 1.21


If the build configuration changes, the unit tests should be run. Seen together 
with NUTCH-3093 and the workflow: 
https://github.com/apache/nutch/actions/runs/12048862236/job/33594395141?pr=840



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3092) Replace all imports of commons-lang by commons-lang3

2024-11-26 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3092:
--

 Summary: Replace all imports of commons-lang by commons-lang3
 Key: NUTCH-3092
 URL: https://issues.apache.org/jira/browse/NUTCH-3092
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The dependency to commons-lang has been upgraded since long to commons-lang3. 
Since then the commons-lang jar is only provided as a transitive dependency and 
might disappear.

We should upgrade our code to use only commons-lang3, see 
https://commons.apache.org/proper/commons-lang/article3_0.html.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3090) Plugin for MIME type detection

2024-11-15 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3090:
--

 Summary: Plugin for MIME type detection
 Key: NUTCH-3090
 URL: https://issues.apache.org/jira/browse/NUTCH-3090
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.21
Reporter: Sebastian Nagel
 Fix For: 1.21


(suggested by [~hiranchaudhuri] in NUTCH-3089)

- introduce a new plugin extension point
  -- allow to provide (and try) different MIME detection tools
  -- but we'll start moving the Tika Mime Magic detector from Nutch core into a 
plugin
 --- reduce the Nutch core dependencies
 --- would allow to include the [container aware 
detection|https://tika.apache.org/3.0.0/detection.html#Container_Aware_Detection)
 into the plugin without adding the tika-parsers-standard and its dependencies 
to the Nutch core dependencies. Cf. NUTCH-3089.
  -- although maybe not two of them at the same time, or we'd need to define 
how results are weighted / combined
- provide a simple fall-back (cleansed HTTP Content-Type header) in case no 
mime-identifier plugin is activated per plugin-includes
- sharing Tika modules between parse-tika, the mime-identifier-tika or 
language-identifier is possible if we create a lib-tika plugin - plugins can 
depend on other plugins. Might be even: lib-tika-core and lib-tika-parsers, or 
anything else.
- one remark: Content objects are created in protocol plugins as part of the 
ProtocolResponse. That is, we'll call a plugin from within a plugin. But this 
is no problem, also the parse filter plugins are called from within a parser 
plugins.

Comments are welcome! This idea needs some specification.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3089) Review MIME type detection

2024-11-15 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898712#comment-17898712
 ] 

Sebastian Nagel commented on NUTCH-3089:


[~hiranchaudhuri] Thanks! This is actually a good idea. Yes, it requires some 
implementation work and some significant changes. But it might be worth it. 
[~tallison], moving the MIME detection to a plugin would reduce its burden in 
terms of a huge dependency tree. I've opened NUTCH-3090 to discuss this idea. 
Let's keep this issue for how the MIME detection is implemented in Nutch, resp. 
how the Tika methods are called. 

> Review MIME type detection
> --
>
> Key: NUTCH-3089
> URL: https://issues.apache.org/jira/browse/NUTCH-3089
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, util
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The MIME detection in {{o.a.n.util.MimeUtil#autoResolveContentType}} needs a 
> review:
> - the fall-back to use the Content-Type HTTP header, only moderately cleaned, 
> leads to strange-looking and obviously misspelled resp. invalid MIME types: 
> "application/.octet-stream", "application/."
>   - note: this issue stems from a [discussion on the Common Crawl user 
> group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ]. 
> More examples are given there.
>   - Tika's method 
> [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)]
>  used in [MimeUtil.java, line 
> 162|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L162]
>  does only a limited validation, not sufficient to filter out the above 
> mentioned erroneous MIME types.
> - performance:  if the property {{mime.type.magic}} == true, Tika's magic 
> detector is called with the binary content and the URL (which includes the 
> file suffix) and the Content-Type HTTP header as additional hints to support 
> the detection. Tika's detect method uses similar fall-back heuristics, 
> calling also {{MimeTypes#forName}}. Relying only on Tika's detect method if 
> {{mime.type.magic}} == true, should save computation time, and eventually 
> leads to more precise results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3089) Review MIME type detection

2024-11-14 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898360#comment-17898360
 ] 

Sebastian Nagel commented on NUTCH-3089:


Currently, tika-parsers-standard is the solely dependency of parse-tika. Nutch 
core (where the MIME detection is done) depends on tika-core. Even with the 
standard parsers only, the parse-tika plugin is the biggest plugin in terms of 
dependencies in MiB. So, not an easy decision to move the tika-parsers-standard 
dependency to core.

> Review MIME type detection
> --
>
> Key: NUTCH-3089
> URL: https://issues.apache.org/jira/browse/NUTCH-3089
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, util
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The MIME detection in {{o.a.n.util.MimeUtil#autoResolveContentType}} needs a 
> review:
> - the fall-back to use the Content-Type HTTP header, only moderately cleaned, 
> leads to strange-looking and obviously misspelled resp. invalid MIME types: 
> "application/.octet-stream", "application/."
>   - note: this issue stems from a [discussion on the Common Crawl user 
> group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ]. 
> More examples are given there.
>   - Tika's method 
> [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)]
>  used in [MimeUtil.java, line 
> 162|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L162]
>  does only a limited validation, not sufficient to filter out the above 
> mentioned erroneous MIME types.
> - performance:  if the property {{mime.type.magic}} == true, Tika's magic 
> detector is called with the binary content and the URL (which includes the 
> file suffix) and the Content-Type HTTP header as additional hints to support 
> the detection. Tika's detect method uses similar fall-back heuristics, 
> calling also {{MimeTypes#forName}}. Relying only on Tika's detect method if 
> {{mime.type.magic}} == true, should save computation time, and eventually 
> leads to more precise results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3089) Review MIME type detection

2024-11-14 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3089:
--

 Summary: Review MIME type detection
 Key: NUTCH-3089
 URL: https://issues.apache.org/jira/browse/NUTCH-3089
 Project: Nutch
  Issue Type: Improvement
  Components: protocol, util
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


The MIME detection in {{o.a.n.util.MimeUtil#autoResolveContentType}} needs a 
review:
- the fall-back to use the Content-Type HTTP header, only moderately cleaned, 
leads to strange-looking and obviously misspelled resp. invalid MIME types: 
"application/.octet-stream", "application/."
  - note: this issue stems from a [discussion on the Common Crawl user 
group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ]. 
More examples are given there.
  - Tika's method 
[MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)]
 used in [MimeUtil.java, line 
162|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L162]
 does only a limited validation, not sufficient to filter out the above 
mentioned erroneous MIME types.
- performance:  if the property {{mime.type.magic}} == true, Tika's magic 
detector is called with the binary content and the URL (which includes the file 
suffix) and the Content-Type HTTP header as additional hints to support the 
detection. Tika's detect method uses similar fall-back heuristics, calling also 
{{MimeTypes#forName}}. Relying only on Tika's detect method if 
{{mime.type.magic}} == true, should save computation time, and eventually leads 
to more precise results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-2599) charset detection issue with parse-tika

2024-11-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2599:


Sorry, [~gbouchar]. I've missed that only the Tika parser is affected. Using 
the plugin parse-html worked.

> charset detection issue with parse-tika
> ---
>
> Key: NUTCH-2599
> URL: https://issues.apache.org/jira/browse/NUTCH-2599
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
> Environment: {code:java}
> plugin.includes: protocol-http|parse-tika{code}
>Reporter: Gerard Bouchar
>Priority: Major
>
> Here is an example page that is displayed correctly in web browsers, but is 
> decoded with the wrong charset in nutch : 
> [https://gerardbouchar.github.io/html-encoding-example/index.html]
>  
> This page's contents are encoded in UTF-8, it is served with HTTP headers 
> indicating that it is in UTF-8, but it contains a bogus HTML meta tag 
> indicating that is encoded in ISO-8859-1.
>  
> This is a tricky case, but there is a [W3C specification about how to handle 
> it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding].
>  It clearly states that the HTTP header (transport layer information) should 
> have precedence over the HTML meta tag (obtained in [byte stream 
> prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]).
>  Browsers do respect the spec, but the tika parser doesn't.
>  
> Looking at [the source 
> code|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java],
>  it looks like the charset information is not even extracted from the HTTP 
> headers.
>  
> {code:java}
> HTTP/1.1 200 OK
> Content-Type: text/html; charset=utf-8
> 
> 
>   
>     
>   
>   
>     français
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2599) charset detection issue with parse-tika

2024-11-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2599:
---
Fix Version/s: 1.21

> charset detection issue with parse-tika
> ---
>
> Key: NUTCH-2599
> URL: https://issues.apache.org/jira/browse/NUTCH-2599
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
> Environment: {code:java}
> plugin.includes: protocol-http|parse-tika{code}
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.21
>
>
> Here is an example page that is displayed correctly in web browsers, but is 
> decoded with the wrong charset in nutch : 
> [https://gerardbouchar.github.io/html-encoding-example/index.html]
>  
> This page's contents are encoded in UTF-8, it is served with HTTP headers 
> indicating that it is in UTF-8, but it contains a bogus HTML meta tag 
> indicating that is encoded in ISO-8859-1.
>  
> This is a tricky case, but there is a [W3C specification about how to handle 
> it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding].
>  It clearly states that the HTTP header (transport layer information) should 
> have precedence over the HTML meta tag (obtained in [byte stream 
> prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]).
>  Browsers do respect the spec, but the tika parser doesn't.
>  
> Looking at [the source 
> code|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java],
>  it looks like the charset information is not even extracted from the HTTP 
> headers.
>  
> {code:java}
> HTTP/1.1 200 OK
> Content-Type: text/html; charset=utf-8
> 
> 
>   
>     
>   
>   
>     français
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2599) charset detection issue with parse-tika

2024-11-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2599:
--

Assignee: Sebastian Nagel

> charset detection issue with parse-tika
> ---
>
> Key: NUTCH-2599
> URL: https://issues.apache.org/jira/browse/NUTCH-2599
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
> Environment: {code:java}
> plugin.includes: protocol-http|parse-tika{code}
>Reporter: Gerard Bouchar
>    Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Here is an example page that is displayed correctly in web browsers, but is 
> decoded with the wrong charset in nutch : 
> [https://gerardbouchar.github.io/html-encoding-example/index.html]
>  
> This page's contents are encoded in UTF-8, it is served with HTTP headers 
> indicating that it is in UTF-8, but it contains a bogus HTML meta tag 
> indicating that is encoded in ISO-8859-1.
>  
> This is a tricky case, but there is a [W3C specification about how to handle 
> it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding].
>  It clearly states that the HTTP header (transport layer information) should 
> have precedence over the HTML meta tag (obtained in [byte stream 
> prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]).
>  Browsers do respect the spec, but the tika parser doesn't.
>  
> Looking at [the source 
> code|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java],
>  it looks like the charset information is not even extracted from the HTTP 
> headers.
>  
> {code:java}
> HTTP/1.1 200 OK
> Content-Type: text/html; charset=utf-8
> 
> 
>   
>     
>   
>   
>     français
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2599) charset detection issue with parse-tika

2024-11-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2599.

Fix Version/s: (was: 1.21)
   Resolution: Fixed

Thanks, [~gbouchar]. This old issue is now fixed, but 

{noformat}
$> bin/nutch parsechecker 
https://gerardbouchar.github.io/html-encoding-example/index.html

...
Outlinks: 1 
 outlink: toUrl: https://gerardbouchar.github.io/ anchor: français
...
Parse Metadata:

  CharEncodingForConversion = utf-8

  OriginalCharEncoding = utf-8{noformat}

Also Tika's detector (Tika 3.0.0) identifies correctly UTF-8.

> charset detection issue with parse-tika
> ---
>
> Key: NUTCH-2599
> URL: https://issues.apache.org/jira/browse/NUTCH-2599
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
> Environment: {code:java}
> plugin.includes: protocol-http|parse-tika{code}
>Reporter: Gerard Bouchar
>Priority: Major
>
> Here is an example page that is displayed correctly in web browsers, but is 
> decoded with the wrong charset in nutch : 
> [https://gerardbouchar.github.io/html-encoding-example/index.html]
>  
> This page's contents are encoded in UTF-8, it is served with HTTP headers 
> indicating that it is in UTF-8, but it contains a bogus HTML meta tag 
> indicating that is encoded in ISO-8859-1.
>  
> This is a tricky case, but there is a [W3C specification about how to handle 
> it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding].
>  It clearly states that the HTTP header (transport layer information) should 
> have precedence over the HTML meta tag (obtained in [byte stream 
> prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]).
>  Browsers do respect the spec, but the tika parser doesn't.
>  
> Looking at [the source 
> code|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java],
>  it looks like the charset information is not even extracted from the HTTP 
> headers.
>  
> {code:java}
> HTTP/1.1 200 OK
> Content-Type: text/html; charset=utf-8
> 
> 
>   
>     
>   
>   
>     français
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Release Nutch 1.21

2024-11-09 Thread Sebastian Nagel

Hi Lewis,

sorry I've missed this mail among the many Jira and Github notifications.

Yes, it's not too early for the next release...

> The following PR’s are currently open

Of course, we should get as many of them merged. I'll also work to fix
some minor bugs.

> Hadoop 3.4.0 was released

Now, even 3.4.1 is available.

Best,
Sebastian

On 10/27/24 01:31, lewis john mcgibbney wrote:

Hi dev@,

In the 1.21 development drive, so far we’ve fixed and resolved 20 issues which 
is great.
The following PR’s are currently open which (with a little more work) could 
likely be included in the release, namely
* https://github.com/apache/nutch/pull/832 
* https://github.com/apache/nutch/pull/830 
* https://github.com/apache/nutch/pull/826 
* https://github.com/apache/nutch/pull/825 


Hadoop 3.4.0 was released on March 17th, 2024 so we could upgrade this 
dependency as well.


It would be worth going through and checking Nutch dependencies as well and 
outlining upgrade candidates.


Anything else?

lewismc






[jira] [Commented] (NUTCH-3087) Nutch crawling inconsistent on URLs with userinfo

2024-10-31 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894476#comment-17894476
 ] 

Sebastian Nagel commented on NUTCH-3087:


Which of the URL normalizers are active? For example, urlnormalizer-basic 
removes the userinfo part for https, http and ftp URLs. There might be a bug 
which does it as well for other schemes, in case there other parts of the URL 
are normalized. Looks like this is a pattern: {{.../architektur.dia%7E}} -> 
{{.../architektur.dia~}}

> Nutch crawling inconsistent on URLs with userinfo
> -
>
> Key: NUTCH-3087
> URL: https://issues.apache.org/jira/browse/NUTCH-3087
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.21
>Reporter: Hiran Chaudhuri
>Priority: Major
>
> I am trying to scan the URL
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> Note the userinfo 'hiran', which is used for authentication on the server. 
> (The smb plugin pulls credentials from another configuration file, but this 
> is irrelevant here).
> The URL is fetched, parsed, updated in the crawldb and sent to the indexer. 
> So far so good. But the outlinks that are detected are of different quality: 
> some have the userinfo preserved, some are missing that information.
> Dumping the segment I can see the below data. Note that some of the outlinks 
> start with smb://hi...@nas.fritz.box, while others start with 
> smb://nas.fritz.box. The impact is that on the next fetch run authentication 
> information is missing and the URLs cannot be fetched further.
>  
> {code:java}
> Recno:: 0
> URL:: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Oct 29 22:56:58 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata: 
>      _ngt_=1730239026566
> Content::
> Version: -1
> url: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> base: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.
> contentType: text/html
> metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0
> Content:
> Index of 
> /Documents/Hiran/Monitoring/Index of 
> /Documents/Hiran/Monitoring/.svn/    Tue Oct 24 
> 13:32:32 CEST 2017
> architektur.dia    Mon Feb 22 21:30:33 CET 2010
> architektur.dia~    Mon Feb 22 21:20:42 CET 
> 2010
> architektur.png    Mon Feb 22 21:34:27 CET 2010
> deployment.dia    Mon Feb 22 22:56:15 CET 2010
> deployment.dia~    Mon Feb 22 22:51:21 CET 
> 2010
> deployment.png    Mon Feb 22 23:00:34 CET 2010
> Monitoring strategy.odt    Fri Aug 01 
> 13:38:04 CEST 2014
> 
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /Documents/Hiran/Monitoring/
> Outlinks: 5
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: 
> architektur.dia Mon Feb 22 21:30:33 CET 2010
>   outlink: toUrl: 
> smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor: 
> architektur.dia~ Mon Feb 22 21:20:42 CET 2010
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: 
> deployment.dia Mon Feb 22 22:56:15 CET 2010
>   outlink: toUrl: 
> smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor: 
> deployment.dia~ Mon Feb 22 22:51:21 CET 2010
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt 
> anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014
> Content Metadata:
>   nutch.segment.name = 20241029225708
>   nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc
>   _fst_ = 33
>   nutch.crawl.score = 1.0
> Parse Metadata:
>   CharEncodingForConversion = windows-1252
>   OriginalCharEncoding = windows-1252
>   language = en
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Oct 29 22:57:25 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 0.0
> Signature: a794c6675cb2f9e460e7771060ed2dfc
> Metadata: 
>  
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Oct 29 22:57:17 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata: 
>      _ngt_=1730239026566
>     _pst_=success(1), lastModified=0
>     Content-Type=text/

[jira] [Resolved] (NUTCH-2771) Tests in nightly builds: speed up long runners

2024-10-29 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2771.

Resolution: Fixed

> Tests in nightly builds: speed up long runners
> --
>
> Key: NUTCH-2771
> URL: https://issues.apache.org/jira/browse/NUTCH-2771
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.16
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> The Nutch tests run by "ant test" or "ant nightly") take rather long to run. 
> Although all tests are implemented as JUnit tests, some tests are more 
> integration tests, eg. launching a Jetty web server and fetching documents 
> from it. It's nice to have also higher level tests, and they are expected to 
> long runner than a simple unit test. However, some of the test classes take 
> really long to run (times taken from 
> https://builds.apache.org/job/Nutch-trunk/3663/consoleText):
> {noformat}
> [junit] Running org.apache.nutch.segment.TestSegmentMergerCrawlDatums
> [junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 133.898 sec
> [junit] Running org.apache.nutch.segment.TestSegmentMerger
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 101.026 sec
> [junit] Running org.apache.nutch.crawl.TestGenerator
> [junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 46.03 sec
> [junit] Running org.apache.nutch.fetcher.TestFetcher
> [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 17.805 sec
> [junit] Running org.apache.nutch.urlfilter.fast.TestFastURLFilter
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 12.36 sec
> [junit] Running org.apache.nutch.parse.tika.TestPdfParser
> [junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 11.974 sec
> [junit] Running org.apache.nutch.parse.tika.TestImageMetadata
> [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 9.113 sec
> [junit] Running org.apache.nutch.parse.feed.TestFeedParser
> [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 6.369 sec
> [junit] Running org.apache.nutch.crawl.TestInjector
> [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 6.15 sec
> {noformat}
> We could try to speed up at least some of these long-running tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3086) Consolidate plugin extension names and IDs

2024-10-27 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3086:
--

 Summary: Consolidate plugin extension names and IDs
 Key: NUTCH-3086
 URL: https://issues.apache.org/jira/browse/NUTCH-3086
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.21
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The "name" and "id" attributes of the "extension" element in some plugin.xml 
needs consolidation because the same name or ID is used in multiple plugins.

Note: because name and ID are optional - "implied" in the plugin.dtd - this has 
no consequences, it's just a cosmetic issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3083) Add RobotRulesParser to bin/nutch

2024-10-27 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3083:
---
Description: The main method of the class 
{{org.apache.nutch.protocol.RobotRulesParser}} is quite useful if it's about 
verifying whether and how robots.txt files are parsed. It should be added to 
bin/nutch as *robotsparser*, similar to "parsechecker", "filterchecker", etc.  
(was: The main method of the class 
{{org.apache.nutch.protocol.RobotRulesParser}} is quite useful if it's about 
verifying whether and how robots.txt files are parsed. It should be added to 
bin/nutch as *robotschecker*, similar to "parsechecker", "filterchecker", etc.)

> Add RobotRulesParser to bin/nutch
> -
>
> Key: NUTCH-3083
> URL: https://issues.apache.org/jira/browse/NUTCH-3083
> Project: Nutch
>  Issue Type: Improvement
>      Components: bin
>Affects Versions: 1.21
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> The main method of the class {{org.apache.nutch.protocol.RobotRulesParser}} 
> is quite useful if it's about verifying whether and how robots.txt files are 
> parsed. It should be added to bin/nutch as *robotsparser*, similar to 
> "parsechecker", "filterchecker", etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3067) Improve performance of FetchItemQueues if error state is preserved

2024-10-24 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3067.

Resolution: Fixed

> Improve performance of FetchItemQueues if error state is preserved
> --
>
> Key: NUTCH-3067
> URL: https://issues.apache.org/jira/browse/NUTCH-3067
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
> Attachments: 
> Screenshot_20240905_101623_fetcher_tasks_many_queues.png, 
> fetcher.map.2024073925.925750.flamegraph.html
>
>
> In certain cases the error state of a fetch queue needs to be
> preserved, even if the queue is (currently) empty, because there might
> be still URLs in the fetcher input not yet read by the QueueFeeder,
> see NUTCH-2947. To keep the queue together with its state is necessary
> - to skip queues together with all items queued now or to be queued
>   later by the QueueFeeder, if a queue exceeds the maximum configured
>   number of exceptions (NUTCH-769). This is mostly a performance feature,
>   but with implications for politeness because also HTTP 403 Forbidden
>   (and similar) are counted as "exceptions".
> - to implement an exponential backoff which slows down the fetching from sites
>   responding with repeated "exceptions" (NUTCH-2946).
> However, there is a drawback when all "stateful" queues are preserved
> until the QueueFeeder has finished reading input fetch lists: Nutch's
> fetch queue implementation becomes slow if there are too many queues.
> This situation / issue was observed in the first cycle of a crawl
> where only the homepages of millions of sites were fetched:
> - about 1 million homepages per fetcher task
> - about 25% of the homepage URLs caused exceptions - the fetch lists was not 
> filtered beforehand whether a site is reachable and is responding
> - consequently, after a certain amount of time (3-4 hours) 250k queues per 
> task were "stateful" and preserved until the fetch list input was entirely 
> read by the QueueFeeder
> - with too many queues and most of them empty (no URLs) the operations on the 
> queues become slow and fetching almost stale (see screenshot)
>   - many queues but few URLs queued (250k vs. 25)
>   - most fetcher threads (190 out of 240) waiting for the lock on one of the 
> synchronized methods of FetchItemQueues
>   - also the QueueFeeder is affected by the lock which explains why only few 
> URLs are queued
> Important notes: this is not an issue
> - if no error state is preserved, that is if 
> {{fetcher.max.exceptions.per.queue == -1}} and 
> {{fetcher.exceptions.per.queue.delay == 0.0}}
> - or if the crawl isn't too "broad" in terms of the number of different hosts 
> (domains or IPs, depending on {{fetcher.queue.mode}})
> As possible solutions:
> 1. do not keep every stateful queue: drop queues which have a low exception 
> count after a configurable amount of time. If a second URL from the same 
> host/domain/IP is fetched after a considerably long time span (eg. 30 
> minutes), the effect on performance and politeness should be negligible.
> 2. review the implementation of FetchItemQueues and the locking (synchronized 
> methods)
> 3. at least, try to prioritize QueueFeeder, for example by a method which 
> adds multiple fetch items within one synchronized call
> Details and data:
> Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached)
> Counts of the top (deepest) line in the stack traces of all Fetcher threads:
> {noformat}
> 120 at 
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
> 49  at 
> org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
> 21  at 
> org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
> 19  at 
> java.net.PlainSocketImpl.socketConnect(java.base@11.0.24/Native Method)
> 18  at 
> java.net.SocketInputStream.socketRead0(java.base@11.0.24/Native Method)
> 6   at java.lang.Object.wait(java.base@11.0.24/Native Method)  # 
> waiting for HTTP/2 stream
> 4   at java.lang.Thread.sleep(java.base@11.0.24/Native Method)
> 2   at 
> java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@11.0.24/Native Method)
> 1   at 
> java.util

[jira] [Resolved] (NUTCH-3078) Database is not unlocked when injector fails

2024-10-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3078.

  Assignee: Hiran Chaudhuri
Resolution: Fixed

PR merged. Thanks, [~hiranchaudhuri]!

> Database is not unlocked when injector fails
> 
>
> Key: NUTCH-3078
> URL: https://issues.apache.org/jira/browse/NUTCH-3078
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>Reporter: Hiran Chaudhuri
>Assignee: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> The injector locks the database but in case of failure does not unlock it. 
> This is a problem on the next invocation. To repeat this, start off with a 
> non-existing crawldb and non-existing seed directory:
> {{./local/bin/nutch inject crawl/crawldb urls}}
> The crawldb is created and locked, but then the injector fails with
> {{2024-10-14 07:43:20,091 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.FileNotFoundException: File urls does not exist}}
> {{    at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)}}
> {{    at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:970)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:418)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> Well, the urls directory indeed does not exist. So let's run the same job 
> with the correct directory:
> {{./local/bin/nutch inject crawl/crawldb ../urls}}
> And despite we have the right directory, the Injector fails with
> {{2024-10-14 07:43:30,147 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.IOException: lock file crawl/crawldb/.locked already 
> exists.}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:50)}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:80)}}
> {{    at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:193)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:404)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> I'd expect when Injector finishes (successful or not) the lock on the DB is 
> removed again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3075) tld plugin makes injector crash

2024-10-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3075.

  Assignee: Hiran Chaudhuri
Resolution: Fixed

Fixed. Thanks, [~hiranchaudhuri]!

> tld plugin makes injector crash
> ---
>
> Key: NUTCH-3075
> URL: https://issues.apache.org/jira/browse/NUTCH-3075
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: * Ubuntu 22 LTS
>  * openjdk version "21.0.4" 2024-07-16 LTS
>Reporter: Hiran Chaudhuri
>Assignee: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> I cloned the current master branch (commit id 
> d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to 
> apache-nutch-1.21-SNAPSHOT.job
> Even after I added my own protocol-imap implementation. Crawling works to 
> some degree - I am heavily experimenting with IMAP and the data I receive in 
> Solr. Looking at the 
> [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
>  I hoped to get better information by adding all the mentioned plugins.
> Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property 
> to include them all. As soon as `tld` is contained, upon seeding my CrawlDb 
> the injector dies with
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     IMAP Protocol Plug-in (protocol-imap)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Indexing Filter)}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: starting}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: crawlDb: crawl/crawldb}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: urlDir: urls}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: Converting injected urls to crawl db entries.}}
> {{2

[jira] [Created] (NUTCH-3083) Add RobotRulesParser to bin/nutch

2024-10-23 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3083:
--

 Summary: Add RobotRulesParser to bin/nutch
 Key: NUTCH-3083
 URL: https://issues.apache.org/jira/browse/NUTCH-3083
 Project: Nutch
  Issue Type: Improvement
  Components: bin
Affects Versions: 1.21
Reporter: Sebastian Nagel
 Fix For: 1.21


The main method of the class {{org.apache.nutch.protocol.RobotRulesParser}} is 
quite useful if it's about verifying whether and how robots.txt files are 
parsed. It should be added to bin/nutch as *robotschecker*, similar to 
"parsechecker", "filterchecker", etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3082) Improve logging in case of Nutch job failure

2024-10-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3082:
---
Fix Version/s: 1.21

> Improve logging in case of Nutch job failure
> 
>
> Key: NUTCH-3082
> URL: https://issues.apache.org/jira/browse/NUTCH-3082
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.21
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> Please improve logging in case a Nutch job fails.
> In NUTCH-3075 we found a situation where the injector job dies but the error 
> message is just a 'failed, reason: not available'. This does not help users 
> at all.
> So the reason of job failure needs to be printed. Or, if the concise reason 
> is not available some hint needs to be printed where to look for the real 
> reason. This is especially true for distributed job execution.
> One speciality that may have been overlooked previously:
> In NUTCH-3075 the job failed during setup, not at runtime. So with this 
> ticket make sure either a reason or a hint pointing to the reason is printed 
> for
>  * local execution failing during setup
>  * local execution failing during execution
>  * distributed execution failing during setup
>  * distributed execution failing during execution
> These tests need to be applied to Injector but potentially to all Nutch 
> hadoop jobs.
>  
> I opened this issue based on
> https://issues.apache.org/jira/browse/NUTCH-3075?focusedCommentId=17891812&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17891812



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3075) tld plugin makes injector crash

2024-10-22 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891812#comment-17891812
 ] 

Sebastian Nagel commented on NUTCH-3075:


Hi [~hiranchaudhuri], would you mind opening a new issue for improvements of 
the logging in case jobs fail? I think it's better to keep in this issue 
focused on the core bug, the broken plugin.xml of the "tld" plugin.


> tld plugin makes injector crash
> ---
>
> Key: NUTCH-3075
> URL: https://issues.apache.org/jira/browse/NUTCH-3075
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: * Ubuntu 22 LTS
>  * openjdk version "21.0.4" 2024-07-16 LTS
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> I cloned the current master branch (commit id 
> d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to 
> apache-nutch-1.21-SNAPSHOT.job
> Even after I added my own protocol-imap implementation. Crawling works to 
> some degree - I am heavily experimenting with IMAP and the data I receive in 
> Solr. Looking at the 
> [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
>  I hoped to get better information by adding all the mentioned plugins.
> Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property 
> to include them all. As soon as `tld` is contained, upon seeding my CrawlDb 
> the injector dies with
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     IMAP Protocol Plug-in (protocol-imap)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Indexing Filter)}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: starting}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: crawlDb: crawl/crawldb}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injecto

[jira] [Updated] (NUTCH-3075) tld plugin makes injector crash

2024-10-20 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3075:
---
Fix Version/s: 1.21

> tld plugin makes injector crash
> ---
>
> Key: NUTCH-3075
> URL: https://issues.apache.org/jira/browse/NUTCH-3075
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: * Ubuntu 22 LTS
>  * openjdk version "21.0.4" 2024-07-16 LTS
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> I cloned the current master branch (commit id 
> d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to 
> apache-nutch-1.21-SNAPSHOT.job
> Even after I added my own protocol-imap implementation. Crawling works to 
> some degree - I am heavily experimenting with IMAP and the data I receive in 
> Solr. Looking at the 
> [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
>  I hoped to get better information by adding all the mentioned plugins.
> Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property 
> to include them all. As soon as `tld` is contained, upon seeding my CrawlDb 
> the injector dies with
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     IMAP Protocol Plug-in (protocol-imap)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Indexing Filter)}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: starting}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: crawlDb: crawl/crawldb}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: urlDir: urls}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: Converting injected urls to crawl db entries.}}
> {{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main] 
> Injecting seed URL file 
> 

[jira] [Commented] (NUTCH-3075) tld plugin makes injector crash

2024-10-20 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891230#comment-17891230
 ] 

Sebastian Nagel commented on NUTCH-3075:


Hi [~hiranchaudhuri], thanks for figuring out the reason of the error!

The error is my fault, it relates to NUTCH-1942. Fix/PR is ready.

I'm not sure about catching exceptions in the setup methods:
- of course, logging them helps to understand the error in local mode more 
quickly
- but there are about 50 job, mapper and reducer implementations, all 
implementing/overriding the setup method. Do we want to change them all? It's 
more about a clear documentation that in case of errors the hadoop.log (or the 
task logs if running in distributed mode) needs to be consulted.

> tld plugin makes injector crash
> ---
>
> Key: NUTCH-3075
> URL: https://issues.apache.org/jira/browse/NUTCH-3075
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: * Ubuntu 22 LTS
>  * openjdk version "21.0.4" 2024-07-16 LTS
>Reporter: Hiran Chaudhuri
>Priority: Major
>
> I cloned the current master branch (commit id 
> d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to 
> apache-nutch-1.21-SNAPSHOT.job
> Even after I added my own protocol-imap implementation. Crawling works to 
> some degree - I am heavily experimenting with IMAP and the data I receive in 
> Solr. Looking at the 
> [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
>  I hoped to get better information by adding all the mentioned plugins.
> Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property 
> to include them all. As soon as `tld` is contained, upon seeding my CrawlDb 
> the injector dies with
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     IMAP Protocol Plug-in (protocol-imap)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index

[jira] [Commented] (NUTCH-3078) Database is not unlocked when injector fails

2024-10-15 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889772#comment-17889772
 ] 

Sebastian Nagel commented on NUTCH-3078:


> LockUtil.createHealthLock();

Sure, it would be more verbose. The downside is that it needs to be implemented 
in all places where the CrawlDb is changed. But I suggest to move all 
discussions about that to NUTCH-3080 and prefer now the simplest solution. 
Thanks, [~hiranchaudhuri]!

> Database is not unlocked when injector fails
> 
>
> Key: NUTCH-3078
> URL: https://issues.apache.org/jira/browse/NUTCH-3078
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> The injector locks the database but in case of failure does not unlock it. 
> This is a problem on the next invocation. To repeat this, start off with a 
> non-existing crawldb and non-existing seed directory:
> {{./local/bin/nutch inject crawl/crawldb urls}}
> The crawldb is created and locked, but then the injector fails with
> {{2024-10-14 07:43:20,091 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.FileNotFoundException: File urls does not exist}}
> {{    at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)}}
> {{    at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:970)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:418)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> Well, the urls directory indeed does not exist. So let's run the same job 
> with the correct directory:
> {{./local/bin/nutch inject crawl/crawldb ../urls}}
> And despite we have the right directory, the Injector fails with
> {{2024-10-14 07:43:30,147 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.IOException: lock file crawl/crawldb/.locked already 
> exists.}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:50)}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:80)}}
> {{    at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:193)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:404)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> I'd expect when Injector finishes (successful or not) the lock on the DB is 
> removed again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3079) Dumping a segment fails unless it has been fetched and parsed

2024-10-15 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3079:
---
Fix Version/s: 1.21

> Dumping a segment fails unless it has been fetched and parsed
> -
>
> Key: NUTCH-3079
> URL: https://issues.apache.org/jira/browse/NUTCH-3079
> Project: Nutch
>  Issue Type: Bug
> Environment: Ubuntu 22 LTS
> $ $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> On some existing crawldb generate a new segment:
> {{./local/bin/nutch generate crawl/crawldb crawl/segments}}
> {{...}}
> {{2024-10-14 07:58:58,589 INFO org.apache.nutch.crawl.Generator [main] 
> Generator: segment: crawl/segments/20241014075858}}
> {{2024-10-14 07:58:59,731 INFO org.apache.nutch.crawl.Generator [main] 
> Generator: finished, elapsed: 3423 ms}}
> Then try to dump this new segment:
> {{./local/bin/nutch readseg -dump crawl/segments/20241014075858 
> crawl/log/dumpsegment}}
> {{This errors out with}}
> {{2024-10-14 08:01:10,448 INFO org.apache.nutch.segment.SegmentReader [main] 
> SegmentReader: dump segment: crawl/segments/20241014075858}}
> {{2024-10-14 08:01:10,705 ERROR org.apache.nutch.segment.SegmentReader [main] 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/content}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_data}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_text}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
> {{    at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:714)}}
> {{    at java.base/javax.security.auth.Subject.doAs(Subject.java:525)}}
> {{    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
> {{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
> {{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
> {{    at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:238)}}
> {{    at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:677)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:765)}}
> {{Caused by: java.io.IOException: Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
> {{{}    ... 17 more{}}}{{{}Exception in thread "main" 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch{}}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/cont

[jira] [Commented] (NUTCH-3079) Dumping a segment fails unless it has been fetched and parsed

2024-10-15 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889592#comment-17889592
 ] 

Sebastian Nagel commented on NUTCH-3079:


+1 Yes, indeed! If run into it many times ... and got used to it :(

Just showing a message that a subdirectory is missing should be sufficient.

> Dumping a segment fails unless it has been fetched and parsed
> -
>
> Key: NUTCH-3079
> URL: https://issues.apache.org/jira/browse/NUTCH-3079
> Project: Nutch
>  Issue Type: Bug
> Environment: Ubuntu 22 LTS
> $ $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>Reporter: Hiran Chaudhuri
>Priority: Major
>
> On some existing crawldb generate a new segment:
> {{./local/bin/nutch generate crawl/crawldb crawl/segments}}
> {{...}}
> {{2024-10-14 07:58:58,589 INFO org.apache.nutch.crawl.Generator [main] 
> Generator: segment: crawl/segments/20241014075858}}
> {{2024-10-14 07:58:59,731 INFO org.apache.nutch.crawl.Generator [main] 
> Generator: finished, elapsed: 3423 ms}}
> Then try to dump this new segment:
> {{./local/bin/nutch readseg -dump crawl/segments/20241014075858 
> crawl/log/dumpsegment}}
> {{This errors out with}}
> {{2024-10-14 08:01:10,448 INFO org.apache.nutch.segment.SegmentReader [main] 
> SegmentReader: dump segment: crawl/segments/20241014075858}}
> {{2024-10-14 08:01:10,705 ERROR org.apache.nutch.segment.SegmentReader [main] 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/content}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_data}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/parse_text}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)}}
> {{    at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1678)}}
> {{    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1675)}}
> {{    at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:714)}}
> {{    at java.base/javax.security.auth.Subject.doAs(Subject.java:525)}}
> {{    at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)}}
> {{    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1675)}}
> {{    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1696)}}
> {{    at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:238)}}
> {{    at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:677)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:765)}}
> {{Caused by: java.io.IOException: Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch}}
> {{    at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)}}
> {{{}    ... 17 more{}}}{{{}Exception in thread "main" 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_fetch{}}}
> {{Input path does not exist: 
> file:/home/hiran/NetBeansProjects/nutch/runtime/crawl/segments/20241014075858/crawl_parse}}
> {{Input

[jira] [Created] (NUTCH-3080) Injector and CrawlDbMerger to keep lockfile if CrawlDb install failed

2024-10-15 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3080:
--

 Summary: Injector and CrawlDbMerger to keep lockfile if CrawlDb 
install failed
 Key: NUTCH-3080
 URL: https://issues.apache.org/jira/browse/NUTCH-3080
 Project: Nutch
  Issue Type: Bug
  Components: crawldb, injector
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


(see the discussion in NUTCH-3078)

Injector and CrawlDbMerger should keep the CrawlDb lockfile if the CrawlDb 
installation fails which may lead to an incomplete CrawlDb, yet a data loss. 
See for comparison 
[CrawlDb.update(...)|https://github.com/apache/nutch/blob/4a61208f492613f2c5282741e64c036acabeb71e/src/java/org/apache/nutch/crawl/CrawlDb.java#L145]
 or 
[DeduplicationJob.run(...)|https://github.com/apache/nutch/blob/4a61208f492613f2c5282741e64c036acabeb71e/src/java/org/apache/nutch/crawl/DeduplicationJob.java].

In addition, there should be a clear message that the lockfile is kept because 
the CrawlDb could be "damaged" which requires manual cleanup or a save-action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3078) Database is not unlocked when injector fails

2024-10-15 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3078:
---
Fix Version/s: 1.21

> Database is not unlocked when injector fails
> 
>
> Key: NUTCH-3078
> URL: https://issues.apache.org/jira/browse/NUTCH-3078
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.21
>
>
> The injector locks the database but in case of failure does not unlock it. 
> This is a problem on the next invocation. To repeat this, start off with a 
> non-existing crawldb and non-existing seed directory:
> {{./local/bin/nutch inject crawl/crawldb urls}}
> The crawldb is created and locked, but then the injector fails with
> {{2024-10-14 07:43:20,091 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.FileNotFoundException: File urls does not exist}}
> {{    at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)}}
> {{    at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:970)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:418)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> Well, the urls directory indeed does not exist. So let's run the same job 
> with the correct directory:
> {{./local/bin/nutch inject crawl/crawldb ../urls}}
> And despite we have the right directory, the Injector fails with
> {{2024-10-14 07:43:30,147 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.IOException: lock file crawl/crawldb/.locked already 
> exists.}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:50)}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:80)}}
> {{    at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:193)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:404)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> I'd expect when Injector finishes (successful or not) the lock on the DB is 
> removed again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3078) Database is not unlocked when injector fails

2024-10-15 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889585#comment-17889585
 ] 

Sebastian Nagel commented on NUTCH-3078:


Hi [~hiranchaudhuri], thanks and good catch!

And thanks for initiative to simplify the lock handling. It's indeed one of the 
cumbersome points where there are many small bugs (such as this one) which make 
the usage of Nutch difficult.

However, there is one situation where an existing CrawlDb might be damaged, if 
the lock is unconditionally removed:

- the exception happens in {{CrawlDb.install(job, crawldb)}} and
-- the folder {{current/}} was successfully moved to {{old/}}
-- but the new, temporary CrawlDb was not copied to the final location 
({{current/}})
-- or is copied only partially, in case, the underlying filesystem does not 
support an atomic directory {{rename()}}. That's usually the case for cloud 
storage abstractions, see [S3A: Directories are 
mimicked|https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Warning_.231:_Directories_are_mimicked]

In this situation, it would be better to keep the lock, so that no CrawlDb 
write operation is allowed to run until a manual cleanup. That allows users to 
analyze what happened and likely to save the data. If the lock is removed, data 
may get lost.

So, that's the reason why the lock/unlock and cleanup code is so complex. It's 
a little bit more than just to prevent that only one job reading or writing the 
CrawlDb is run simultaneously.

It's also the reason why try-catch blocks should be "focused" on errors 
happening when running the job. It shouldn't include the {{CrawlDb.install(job, 
crawldb)}} call. Currently, it does which is wrong - in Injector but also in 
CrawlDbMerger - but that's a separate issue. See for comparison 
[CrawlDb.update(...)|https://github.com/apache/nutch/blob/4a61208f492613f2c5282741e64c036acabeb71e/src/java/org/apache/nutch/crawl/CrawlDb.java#L145]
 or 
[DeduplicationJob.run(...)|https://github.com/apache/nutch/blob/4a61208f492613f2c5282741e64c036acabeb71e/src/java/org/apache/nutch/crawl/DeduplicationJob.java].

Another way to make the cleanup of the lock file simpler, would be to obtain 
the lock later, shortly before running the job...

> Database is not unlocked when injector fails
> 
>
> Key: NUTCH-3078
> URL: https://issues.apache.org/jira/browse/NUTCH-3078
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>Reporter: Hiran Chaudhuri
>Priority: Major
>
> The injector locks the database but in case of failure does not unlock it. 
> This is a problem on the next invocation. To repeat this, start off with a 
> non-existing crawldb and non-existing seed directory:
> {{./local/bin/nutch inject crawl/crawldb urls}}
> The crawldb is created and locked, but then the injector fails with
> {{2024-10-14 07:43:20,091 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.FileNotFoundException: File urls does not exist}}
> {{    at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)}}
> {{    at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:970)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:418)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> Well, the urls directory indeed does not exist. So let's run the same job 
> with the correct directory:
> {{./local/bin/nutch inject crawl/crawldb ../urls}}
> And despite we have the right directory, the Injector fails with
> {{2024-10-14 07:43:30,147 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.IOException: lock file crawl/crawldb/.locked already 
> exists.}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:50)}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:80)}}
> {{    at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:193)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:404)}}
&g

[jira] [Resolved] (NUTCH-3073) Address Java compiler warnings

2024-10-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3073.

Resolution: Fixed

> Address Java compiler warnings
> --
>
> Key: NUTCH-3073
> URL: https://issues.apache.org/jira/browse/NUTCH-3073
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>    Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Depending on the warning level, there are many Java compiler warnings in the 
> Nutch code base. This issue and a PR ready soon are about to cover them, 
> resp. their fixup. The following warnings are addressed:
> - unused imports
> - missing type arguments (Collections, etc.)
> - unused variables
> - leaking resources



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3073) Address Java compiler warnings

2024-10-04 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3073:
--

 Summary: Address Java compiler warnings
 Key: NUTCH-3073
 URL: https://issues.apache.org/jira/browse/NUTCH-3073
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Depending on the warning level, there are many Java compiler warnings in the 
Nutch code base. This issue and a PR ready soon are about to cover them, resp. 
their fixup. The following warnings are addressed:
- unused imports
- missing type arguments (Collections, etc.)
- unused variables
- leaking resources



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3072) Fetcher to stop QueueFeeder if aborting with "hung threads"

2024-10-04 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3072:
--

 Summary: Fetcher to stop QueueFeeder if aborting with "hung 
threads"
 Key: NUTCH-3072
 URL: https://issues.apache.org/jira/browse/NUTCH-3072
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Fetcher is shutting down if there is no progress (not a single URL fetched) 
during half of the MapReduce task timeout, see fetcher.thread.timeout.divisor 
and NUTCH-1057. Before the shut-down Fetcher reports the active FetcherThreads 
as "hung threads" and drops existing FetchQueues. After that the task continues 
with sorting and merging the spilled data. FetcherThreads and also the 
QueueFeeder might be still running in this moment, which opens potential 
concurrency issues when a FetcherThread writes output data while the output is 
already in the process of sorting.

Fetcher should stop the QueueFeeder and/or make sure it isn't alive anymore. In 
addition, a short wait (one second) should help FetcherThreads to shut down.

The issue was observed while testing a solution for NUTCH-3067.

{noformat}
2024-10-02 09:33:18,796 INFO [main] fetcher.Fetcher: -activeThreads=120, 
spinWaiting=119, fetchQueues.totalSize=12000, fetchQueues.getQueueCount=9884
2024-10-02 09:33:18,797 WARN [main] fetcher.Fetcher: Aborting with 120 hung 
threads.
...
2024-10-02 09:33:18,828 WARN [main] fetcher.Fetcher: Aborting with 12000 queued 
fetch items in 9884 queues (queue feeder still alive).
2024-10-02 09:33:18,828 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
... (reporting dropped queues)
2024-10-02 09:33:18,903 INFO [main] fetcher.FetchItemQueues: Emptied all 
queues: 9279 queues with 12000 items
2024-10-02 09:33:18,906 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: Starting flush of map output
2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: Spilling map output
2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: bufstart = 124101175; 
bufend = 177094062; bufvoid = 314572800
2024-10-02 09:33:18,906 INFO [main] mapred.MapTask: kvstart = 
31025288(124101152); kvend = 30983880(123935520); length = 41409/19660800
2024-10-02 09:33:18,907 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
...
2024-10-02 09:33:19,292 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
2024-10-02 09:33:19,294 INFO [main] mapred.Merger: Merging 4 sorted segments
2024-10-02 09:33:19,295 INFO [main] mapred.Merger: Down to the last merge-pass, 
with 4 segments left of total size: 979973 bytes
2024-10-02 09:33:19,296 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
...
2024-10-02 09:33:19,478 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
2024-10-02 09:33:19,480 DEBUG [QueueFeeder] fetcher.QueueFeeder: -feeding 12000 
input urls ...
2024-10-02 09:33:19,480 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
2024-10-02 09:33:19,481 DEBUG [FetcherThread] fetcher.FetcherThread: 
FetcherThread spin-waiting ...
2024-10-02 09:33:19,481 ERROR [QueueFeeder] fetcher.QueueFeeder: QueueFeeder 
error reading input, record 89118
java.io.IOException: Stream closed
at 
org.apache.hadoop.io.compress.DecompressorStream.checkStream(DecompressorStream.java:184)
...
at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:120)
...
2024-10-02 09:33:19,507 INFO [FetcherThread] fetcher.FetcherThread: 
FetcherThread 1285 has no more work available
2024-10-02 09:33:19,507 INFO [FetcherThread] fetcher.FetcherThread: 
FetcherThread 1285 -finishing thread FetcherThread, activeThreads=104
2024-10-02 09:33:19,507 INFO [main] mapred.Merger: Merging 4 sorted segments
2024-10-02 09:33:19,508 INFO [FetcherThread] fetcher.FetcherThread: 
FetcherThread 319 has no more work available
2024-10-02 09:33:19,508 INFO [FetcherThread] fetcher.FetcherThread: 
FetcherThread 319 -finishing thread FetcherThread, activeThreads=103
2024-10-02 09:33:19,509 INFO [main] mapred.Merger: Down to the last merge-pass, 
with 4 segments left of total size: 959873 bytes
2024-10-02 09:33:19,509 INFO [FetcherThread] fetcher.FetcherThread: 
FetcherThread 1112 has no more work available
2024-10-02 09:33:19,509 INFO [FetcherThread] fetcher.FetcherThread: 
FetcherThread 1112 -finishing thread FetcherThread, activeThreads=102
...
{noformat}

Later on, one of the reducer tasks failed when reading data. This caused the 
job to fail:
{noformat}
2024-10-02 10:37:31,347 INFO mapreduce.Job:  map 100% reduce 92%
2024-10-02 10:37:31,347 INFO mapreduce.Job: Task Id : 
attempt_1727715735602_0060_r_07_1, Status : FAILED
Error: java.lang.ArrayIndexOutOfBound

[jira] [Assigned] (NUTCH-3068) Documentation on Nutch Homepage

2024-10-02 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3068:
--

Assignee: Sebastian Nagel

> Documentation on Nutch Homepage
> ---
>
> Key: NUTCH-3068
> URL: https://issues.apache.org/jira/browse/NUTCH-3068
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, website
>Reporter: Hiran Chaudhuri
>    Assignee: Sebastian Nagel
>Priority: Major
>
> On [https://nutch.apache.org/] I can see that Nutch is extensible:
> _Provides intuitive and stable interfaces for popular functions i.e., 
> [Parsers|https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/parse/Parser.html],
>  [HTML 
> Filtering|https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/parse/HtmlParseFilter.html],
>  
> [Indexing|https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/indexer/IndexingFilter.html]
>  and 
> [Scoring|https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/scoring/ScoringFilter.html]
>  for custom implementations._
>  
> Here I am missing the protocol plugins.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3070) Documentation has outdated links

2024-10-02 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3070.

Resolution: Fixed

Thanks for reporting, [~hiranchaudhuri]!

> Documentation has outdated links
> 
>
> Key: NUTCH-3070
> URL: https://issues.apache.org/jira/browse/NUTCH-3070
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, wiki
>Reporter: Hiran Chaudhuri
>    Assignee: Sebastian Nagel
>Priority: Major
>
> On the Nutch wiki 
> [https://cwiki.apache.org/confluence/display/NUTCH/AboutPlugins] there are 
> various links pointing to Nutch 1.18 apidoc. All of them are invalid as the 
> documentation is not found at that location.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3070) Documentation has outdated links

2024-10-02 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3070:
--

Assignee: Sebastian Nagel

> Documentation has outdated links
> 
>
> Key: NUTCH-3070
> URL: https://issues.apache.org/jira/browse/NUTCH-3070
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, wiki
>Reporter: Hiran Chaudhuri
>    Assignee: Sebastian Nagel
>Priority: Major
>
> On the Nutch wiki 
> [https://cwiki.apache.org/confluence/display/NUTCH/AboutPlugins] there are 
> various links pointing to Nutch 1.18 apidoc. All of them are invalid as the 
> documentation is not found at that location.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3070) Documentation has outdated links

2024-10-02 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3070:
---
Component/s: wiki

> Documentation has outdated links
> 
>
> Key: NUTCH-3070
> URL: https://issues.apache.org/jira/browse/NUTCH-3070
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, wiki
>Reporter: Hiran Chaudhuri
>Priority: Major
>
> On the Nutch wiki 
> [https://cwiki.apache.org/confluence/display/NUTCH/AboutPlugins] there are 
> various links pointing to Nutch 1.18 apidoc. All of them are invalid as the 
> documentation is not found at that location.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3069) Update protocol-smb reference

2024-10-02 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3069:
---
Component/s: wiki

> Update protocol-smb reference
> -
>
> Key: NUTCH-3069
> URL: https://issues.apache.org/jira/browse/NUTCH-3069
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, wiki
>Reporter: Hiran Chaudhuri
>Priority: Major
>
> On the page [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] 
> I can see this text:
> _..._
> _[protocol-smb|http://issues.apache.org/jira/browse/NUTCH-427] - Allows Nutch 
> to crawl MS Windows Shares folder._
> _..._
>  
> The link still references an unusable plugin, it's Jira ticket has been 
> closed as {_}won't fix{_}.
> For users it is more useful to admit there currently is no usable plugin, or 
> link the more up to date Jira issue NUTCH-2856.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3071) Tutorial for Intranet Document Search outdated

2024-10-02 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3071:
---
Component/s: wiki

> Tutorial for Intranet Document Search outdated
> --
>
> Key: NUTCH-3071
> URL: https://issues.apache.org/jira/browse/NUTCH-3071
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, wiki
>Reporter: Hiran Chaudhuri
>Priority: Minor
>
> On the page 
> [https://cwiki.apache.org/confluence/display/NUTCH/IntranetDocumentSearch] 
> the schema.xml file for Solr is claimed to be in the nutch conf directory. At 
> least in the current master branch that is no longer the case.
> Searching for a schema.xml I found something sutable at 
> src/plugin/indexer-solr/schema.xml, and this file is also mentioned in 
> [https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial#NutchTutorial-SetupSolrforsearch]
> Maybe the IntranetDocumentSearch should simply point to the 
> SetupSolrforsearch chapter.
>  
> But even following the SetupSolrforsearch does not help fully. When running 
> the command
> bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ 
> crawl/segments/20131108063838/ -filter -normalize -deleteGone
> I am getting the message
> INFO o.a.n.i.IndexerOutputFormat [pool-5-thread-1] No IndexWriters activated 
> - check your configuration
>  
> So some step to modify the Nutch config files is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-10-02 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3056:
---
Component/s: injector

> Injector to support resolving seed URLs
> ---
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3056.patch
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#00}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector. Seeds not leading 
> to a non-200 URL will be discarded. Enabling filtering and normalization is 
> highly recommended for handling the redirects.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3071) Tutorial for Intranet Document Search outdated

2024-10-02 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886400#comment-17886400
 ] 

Sebastian Nagel commented on NUTCH-3071:


Hi [~hiranchaudhuri], thanks for reporting! We'll update the documentation. 
Also for the remaining issues you opened regarding outdated information on the 
Nutch wiki. Note that there is an option for anybody to comment on or even edit 
the wiki pages. The latter requires more administrative work, see 
https://infra.apache.org/cwiki.html.

> Tutorial for Intranet Document Search outdated
> --
>
> Key: NUTCH-3071
> URL: https://issues.apache.org/jira/browse/NUTCH-3071
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Hiran Chaudhuri
>Priority: Minor
>
> On the page 
> [https://cwiki.apache.org/confluence/display/NUTCH/IntranetDocumentSearch] 
> the schema.xml file for Solr is claimed to be in the nutch conf directory. At 
> least in the current master branch that is no longer the case.
> Searching for a schema.xml I found something sutable at 
> src/plugin/indexer-solr/schema.xml, and this file is also mentioned in 
> [https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial#NutchTutorial-SetupSolrforsearch]
> Maybe the IntranetDocumentSearch should simply point to the 
> SetupSolrforsearch chapter.
>  
> But even following the SetupSolrforsearch does not help fully. When running 
> the command
> bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ 
> crawl/segments/20131108063838/ -filter -normalize -deleteGone
> I am getting the message
> INFO o.a.n.i.IndexerOutputFormat [pool-5-thread-1] No IndexWriters activated 
> - check your configuration
>  
> So some step to modify the Nutch config files is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj

2024-10-02 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886393#comment-17886393
 ] 

Sebastian Nagel commented on NUTCH-2856:


Hi [~hiranchaudhuri], yes and of course. Contributions are always welcome! If 
you feel, your code is not yet complete or otherwise ready for use, you might 
open a "draft PR" and add a list of TODOs in the description. I'm sure we can 
support you to get it finally running. And I hope it'll take less time than 
getting NUTCH-2429 into production. :) [~lewismc], anything to add from your 
side?

> Implement a protocol-smb plugin based on hierynomus/smbj
> 
>
> Key: NUTCH-2856
> URL: https://issues.apache.org/jira/browse/NUTCH-2856
> Project: Nutch
>  Issue Type: New Feature
>  Components: external, plugin, protocol
>Reporter: Hiran Chaudhuri
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.21
>
>
> The plugin protocol-smb advertized on 
> [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually 
> refers to the JCIFS library. According to this library's homepage 
> [https://www.jcifs.org/]:
> _If you're looking for the latest and greatest open source Java SMB library, 
> this is not it. JCIFS has been in maintenance-mode-only for several years and 
> although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and 
> various utility classes), jCIFS does not support the newer SMB2/3 variants of 
> the SMB protocol which is slowly becoming required (Windows 10 requires 
> SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their 
> products. *So if SMB1 is disabled on your network, JCIFS' file related 
> operations will NOT work.*_
> Looking at 
> [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1]
> _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June 
> 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators 
> Update do not have SMB1 installed by default._
> As a conclusion, the chances that SMB1 protocol is installed and/or 
> configured are getting vastly smaller. Therefore some migration towards 
> SMB2/3 is required. Luckily the JCIFS homepage lists alternatives:
>  * [jcifs-codelibs|https://github.com/codelibs/jcifs]
>  * [jcifs-ng|https://github.com/AgNO3/jcifs-ng]
>  * [smbj|https://github.com/hierynomus/smbj]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2812) Methods returning array may expose internal representation

2024-09-17 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2812.

Resolution: Fixed

> Methods returning array may expose internal representation
> --
>
> Key: NUTCH-2812
> URL: https://issues.apache.org/jira/browse/NUTCH-2812
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.17
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.21
>
>
> Returning a reference to a mutable object value stored in one of the object's 
> fields exposes the internal representation of the object.  If instances are 
> accessed by untrusted code, and unchecked changes to the mutable object would 
> compromise security or other important properties, you will need to do 
> something different. Returning a new copy of the object is better approach in 
> many situations.
> For example org.apache.nutch.fetcher.FetchNode.getOutlinks() may expose 
> internal representation by returning FetchNode.outlinks
> There are 11 such occurrences of this bug in the codebase. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-1942) Remove TopLevelDomain

2024-09-17 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1942.

Resolution: Done

> Remove TopLevelDomain 
> --
>
> Key: NUTCH-1942
> URL: https://issues.apache.org/jira/browse/NUTCH-1942
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Priority: Minor
>  Labels: crawler-commons, newbie
> Fix For: 1.21
>
>
> We should leverage the domain related utilities from crawler-commons instead 
> of duplicating them in the `org.apache.nutch.util.domain` package. For 
> instance we could deprecate TopLevelDomain and call the corresponding class 
> in CC instead. The resources in CC are more up-to-date and it is less code to 
> maintain.
> This would be a good task for someone willing to get to know the Nutch 
> codebase better and impress us all with the extent of his/her skills.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-09-17 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1806.

Resolution: Implemented

Thanks, everybody!

> Delegate processing of URL domains to crawler commons
> -
>
> Key: NUTCH-1806
> URL: https://issues.apache.org/jira/browse/NUTCH-1806
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Major
>  Labels: crawler-commons
> Fix For: 1.21
>
>
> We have code in src/java/org/apache/nutch/util/domain and a resource file 
> conf/domain-suffixes.xml to handle URL domains. This is used mostly from 
> URLUtil.getDomainName.
> The resource file is not necessarily up to date and since crawler commons has 
> a similar functionality we should use it instead of having to maintain our 
> own resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3058) Fetcher: counter for hung threads

2024-09-16 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3058.

Resolution: Implemented

> Fetcher: counter for hung threads
> -
>
> Key: NUTCH-3058
> URL: https://issues.apache.org/jira/browse/NUTCH-3058
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce 
> task timeout, see {{mapreduce.task.timeout}} and 
> {{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but 
> without any progress during the timeout period (in terms of newly started 
> fetch items), Fetcher is shut down to avoid that the task timeout is reached 
> and the fetcher job is failed. The "hung threads" are logged together with 
> the URL being fetched and (DEBUG level) the Java stack.
> In addition to logging, a job counter should indicate the number of hung 
> threads. This would allow to see on the job level whether there are issues 
> with hung threads. To trace the issues it's still required to look into the 
> Hadoop task logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-09-14 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881792#comment-17881792
 ] 

Sebastian Nagel commented on NUTCH-3059:


Note: the above test was run in pseudo-distributed mode because in local mode 
only one partition per segment is generated. The counters are correct, as shown 
by comparison with segment counts:
{noformat}
$> nutch readseg -list -dir
NAMEGENERATED   FETCHER START   FETCHER END 
FETCHED PARSED
20240914162841  1000?   ?   ?   ?
20240914162906  399 ?   ?   ?   ?
{noformat}

> Generator: selector job does not count reduce output records
> 
>
> Key: NUTCH-3059
> URL: https://issues.apache.org/jira/browse/NUTCH-3059
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> The selector step (job) of the Generator does not count the reduce output 
> records resp. shows the count "0":
> {noformat}
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting 
> best-scoring urls due for fetch.
> ...
>          Map-Reduce Framework
>                 Map input records=6
>                 Map output records=6
>                 ...
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=594
>                 Reduce input records=6
>                 Reduce output records=0
>                 Spilled Records=12
>                 ...
> {noformat}
> Not a big issue but should investigate why this happens. The other counters 
> seem to work properly, also the partitioner job shows the reduce output 
> records. The issue is observed in local and distributed mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-09-14 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881791#comment-17881791
 ] 

Sebastian Nagel commented on NUTCH-3059:


Ok, found the reason: it's because of 
[MultipleOutputs|https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html]
 used by Generator to write one output per segment and fetch list partition.

We could set
{code}
MultipleOutputs.setCountersEnabled(job, true);
{code}
which would add one counter for each segment:
{noformat}
$> nutch generate crawldb segments -topN 1000 -numFetchers 3 -maxNumSegments 2
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs
fetchlist-1/part=1000
fetchlist-2/part=399
{noformat}
and a segments/ directory tree (after the partition job):
{noformat}
segments/
 |---20240914162841/
 |   `-crawl_generate/
 | |-part-r-0
 | |-part-r-1
 | `-part-r-2
 `---20240914162906/
 `-crawl_generate/
   |-part-r-0
   |-part-r-1
   `-part-r-2
{noformat}

Any thoughts or objections? Otherwise, I would open a PR...

> Generator: selector job does not count reduce output records
> 
>
> Key: NUTCH-3059
> URL: https://issues.apache.org/jira/browse/NUTCH-3059
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> The selector step (job) of the Generator does not count the reduce output 
> records resp. shows the count "0":
> {noformat}
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting 
> best-scoring urls due for fetch.
> ...
>          Map-Reduce Framework
>                 Map input records=6
>                 Map output records=6
>                 ...
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=594
>                 Reduce input records=6
>                 Reduce output records=0
>                 Spilled Records=12
>                 ...
> {noformat}
> Not a big issue but should investigate why this happens. The other counters 
> seem to work properly, also the partitioner job shows the reduce output 
> records. The issue is observed in local and distributed mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3061) URL filters to log name of the rule file rules are read from

2024-09-13 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3061.

Resolution: Implemented

> URL filters to log name of the rule file rules are read from
> 
>
> Key: NUTCH-3061
> URL: https://issues.apache.org/jira/browse/NUTCH-3061
> Project: Nutch
>  Issue Type: Improvement
>  Components: urlfilter
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> Some of the URL filters already log the name of the rule file from which 
> rules are read. This is helpful if a custom log file is defined in the 
> configuration. The following do not yet log the rules file name: 
> urlfilter-regex, urlfilter-automaton, urlfilter-fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3062) protocol-okhttp: optionally record HTTP and SSL/TLS versions

2024-09-13 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3062.

Resolution: Implemented

> protocol-okhttp: optionally record HTTP and SSL/TLS versions
> 
>
> Key: NUTCH-3062
> URL: https://issues.apache.org/jira/browse/NUTCH-3062
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Background: WARC headers may record the HTTP and SSL/TLS versions (see 
> [iipc/warc-specifications#42|https://github.com/iipc/warc-specifications/issues/42])
>  and the SSL/TLS cipher suites (see 
> [iipc/warc-specifications#86|https://github.com/iipc/warc-specifications/issues/86]).
>  This issue is about to track the necessary information, for now, in 
> protocol-okhttp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3065) Format changelog as Markdown

2024-09-13 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3065.

Resolution: Implemented

> Format changelog as Markdown
> 
>
> Key: NUTCH-3065
> URL: https://issues.apache.org/jira/browse/NUTCH-3065
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Together with the release of 1.20 the changelog was renamed from CHANGES.txt 
> to CHANGES.md, i.e. it should also follow the Markdown syntax.
> - Markdown allows for links and is displayed nicely on many web-based Git 
> front-ends
> - see the [1.20 
> CHANGES.md|https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md] on 
> GitHub
> - see also the discussion in the [1.20 VOTE 
> thread|https://lists.apache.org/thread/vc2rdwqbnq47nl23d88gth1dpnsgfccr]
> This issue is about to address two points:
> - add the 1.20 release notes as Markdown to CHANGES.md
> - reformat previous release notes as far as possible



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3066) Protocol plugin unit tests fail randomly

2024-09-13 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3066.

Resolution: Fixed

> Protocol plugin unit tests fail randomly
> 
>
> Key: NUTCH-3066
> URL: https://issues.apache.org/jira/browse/NUTCH-3066
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol, test
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The HTTP protocol plugin unit tests may fail at random. The reasons are:
> - the unit tests run the test web server delivering test pages using the same 
> port (inherited from AbstractHttpProtocolPluginTest)
> - the plugin unit tests are executed in parallel (two concurrent threads)
> From time to time two tests try to launch a web server listening on the same 
> port. This consequently causes a failure, e.g. in a [GitHub 
> workflow|https://github.com/apache/nutch/actions/runs/10728348673/job/29752702091?pr=823#step:4:7667]:
> {noformat}
> [junit] Tests run: 14, Failures: 1, Errors: 0, Skipped: 4, Time elapsed: 
> 4.735 sec
> [junit] Test org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED
> {noformat}
> The error message in 
> (TEST-org.apache.nutch.protocol.http.TestBadServerResponses.txt, from a local 
> test run):
> {noformat}
> 2024-09-06 08:36:32,549 INFO o.a.n.p.AbstractHttpProtocolPluginTest 
> [Thread-3] Socket on port 47505 closed: Address already in use (Bind failed)
> 2024-09-06 08:36:32,550 INFO o.a.n.p.AbstractHttpProtocolPluginTest 
> [Thread-2] Socket on port 47505 closed: Socket closed
> 2024-09-06 08:36:32,599 INFO o.a.n.p.AbstractHttpProtocolPluginTest [main] 
> Fetching http://127.0.0.1:47505/
> 2024-09-06 08:36:32,600 ERROR o.a.n.p.h.Http [main] Failed to get protocol 
> output
> java.net.ConnectException: Connection refused (Connection refused)
> {noformat}
> Possible solutions:
> 1. do not run plugin unit tests in parallel.
> -- Note: the parallelism does not save a lot of time. A test run on my 
> laptop: 6'19'' (2 threads, failed) vs. 6'55'' (1 thread, success)
> 2. override the port in each unit test and
> -- ensure that a unique port number is used. Note: manually assigning 
> unique numbers is difficult to maintain when new tests are added.
> -- or choose a random port number (making collisions unlikely)
> 3. try another port if one is already in use



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-09-11 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880958#comment-17880958
 ] 

Sebastian Nagel commented on NUTCH-1806:


> it seems odd to return an empty String instead of null when nothing is found

Ok, agreed. Changed the behavior back to the previous one, see [commit 
40881e8b|https://github.com/apache/nutch/pull/816/commits/40881e8b755e24d78a60689bd818058daba1a6fc].

> Delegate processing of URL domains to crawler commons
> -
>
> Key: NUTCH-1806
> URL: https://issues.apache.org/jira/browse/NUTCH-1806
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Major
>  Labels: crawler-commons
> Fix For: 1.21
>
>
> We have code in src/java/org/apache/nutch/util/domain and a resource file 
> conf/domain-suffixes.xml to handle URL domains. This is used mostly from 
> URLUtil.getDomainName.
> The resource file is not necessarily up to date and since crawler commons has 
> a similar functionality we should use it instead of having to maintain our 
> own resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3067) Improve performance of FetchItemQueues if error state is preserved

2024-09-07 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3067:
--

 Summary: Improve performance of FetchItemQueues if error state is 
preserved
 Key: NUTCH-3067
 URL: https://issues.apache.org/jira/browse/NUTCH-3067
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21
 Attachments: Screenshot_20240905_101623_fetcher_tasks_many_queues.png, 
fetcher.map.2024073925.925750.flamegraph.html

In certain cases the error state of a fetch queue needs to be
preserved, even if the queue is (currently) empty, because there might
be still URLs in the fetcher input not yet read by the QueueFeeder,
see NUTCH-2947. To keep the queue together with its state is necessary

- to skip queues together with all items queued now or to be queued
  later by the QueueFeeder, if a queue exceeds the maximum configured
  number of exceptions (NUTCH-769). This is mostly a performance feature,
  but with implications for politeness because also HTTP 403 Forbidden
  (and similar) are counted as "exceptions".

- to implement an exponential backoff which slows down the fetching from sites
  responding with repeated "exceptions" (NUTCH-2946).

However, there is a drawback when all "stateful" queues are preserved
until the QueueFeeder has finished reading input fetch lists: Nutch's
fetch queue implementation becomes slow if there are too many queues.
This situation / issue was observed in the first cycle of a crawl
where only the homepages of millions of sites were fetched:
- about 1 million homepages per fetcher task
- about 25% of the homepage URLs caused exceptions - the fetch lists was not 
filtered beforehand whether a site is reachable and is responding
- consequently, after a certain amount of time (3-4 hours) 250k queues per task 
were "stateful" and preserved until the fetch list input was entirely read by 
the QueueFeeder
- with too many queues and most of them empty (no URLs) the operations on the 
queues become slow and fetching almost stale (see screenshot)
  - many queues but few URLs queued (250k vs. 25)
  - most fetcher threads (190 out of 240) waiting for the lock on one of the 
synchronized methods of FetchItemQueues
  - also the QueueFeeder is affected by the lock which explains why only few 
URLs are queued

Important notes: this is not an issue
- if no error state is preserved, that is if {{fetcher.max.exceptions.per.queue 
== -1}} and {{fetcher.exceptions.per.queue.delay == 0.0}}
- or if the crawl isn't too "broad" in terms of the number of different hosts 
(domains or IPs, depending on {{fetcher.queue.mode}})

As possible solutions:

1. do not keep every stateful queue: drop queues which have a low exception 
count after a configurable amount of time. If a second URL from the same 
host/domain/IP is fetched after a considerably long time span (eg. 30 minutes), 
the effect on performance and politeness should be negligible.

2. review the implementation of FetchItemQueues and the locking (synchronized 
methods)

3. at least, try to prioritize QueueFeeder, for example by a method which adds 
multiple fetch items within one synchronized call


Details and data:

Screenshot of the Fetcher map task status in the Hadoop YARN Web UI (attached)

Counts of the top (deepest) line in the stack traces of all Fetcher threads:
{noformat}
120 at 
org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
49  at 
org.apache.nutch.fetcher.FetchItemQueues.checkExceptionThreshold(FetchItemQueues.java:281)
21  at 
org.apache.nutch.fetcher.FetchItemQueues.getFetchItemQueue(FetchItemQueues.java:166)
19  at 
java.net.PlainSocketImpl.socketConnect(java.base@11.0.24/Native Method)
18  at 
java.net.SocketInputStream.socketRead0(java.base@11.0.24/Native Method)
6   at java.lang.Object.wait(java.base@11.0.24/Native Method)  # 
waiting for HTTP/2 stream
4   at java.lang.Thread.sleep(java.base@11.0.24/Native Method)
2   at 
java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@11.0.24/Native Method)
1   at 
java.util.Collections$SynchronizedCollection.size(java.base@11.0.24/Collections.java:2017)
{noformat}

Full stack traces (three examples):
{noformat}
"FetcherThread" #38 daemon prio=5 os_prio=0 cpu=43743.17ms elapsed=15890.29s 
tid=0x752967fff800 nid=0x83a3c waiting for monitor entry  
[0x75292fcf9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.nutch.fetcher.FetchItemQueues.getFetchItem(FetchItemQueues.java:177)
- waiting to lock <0x00066894b9d8> (a 
org.apache.nutch.fetcher.FetchItemQueues)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:301)

[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-09-07 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880036#comment-17880036
 ] 

Sebastian Nagel commented on NUTCH-1806:


Any comments on this? It's an important improvement, imho. But also a 
significant change.

> Delegate processing of URL domains to crawler commons
> -
>
> Key: NUTCH-1806
> URL: https://issues.apache.org/jira/browse/NUTCH-1806
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Major
>  Labels: crawler-commons
> Fix For: 1.21
>
>
> We have code in src/java/org/apache/nutch/util/domain and a resource file 
> conf/domain-suffixes.xml to handle URL domains. This is used mostly from 
> URLUtil.getDomainName.
> The resource file is not necessarily up to date and since crawler commons has 
> a similar functionality we should use it instead of having to maintain our 
> own resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3063) Support for "addBinaryContent" from REST API

2024-09-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3063.

Resolution: Implemented

Committed in 
[ac03cf1|https://github.com/apache/nutch/commit/ac03cf1646f5af152daeb9f0bef3fec2b51739c2].
 Thanks, [~igiguere]!

> Support for "addBinaryContent" from REST API
> 
>
> Key: NUTCH-3063
> URL: https://issues.apache.org/jira/browse/NUTCH-3063
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Isabelle Giguere
>Assignee: Isabelle Giguere
>Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3063.patch
>
>
> NUTCH-1785 added the possibility of requesting the raw binary content, with 
> arg `addBinaryContent`, and possibly encode it as `base64`.
> This functionality should also be supported from the REST API.
> Integrating Nutch using the CLI is out of the question for some applications, 
> and at the same time some may need the raw content for further processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3063) Support for "addBinaryContent" from REST API

2024-09-06 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879964#comment-17879964
 ] 

Sebastian Nagel commented on NUTCH-3063:


+1 looks good. And definitely makes sense. Good catch! Going to apply the patch 
and commit...

> Support for "addBinaryContent" from REST API
> 
>
> Key: NUTCH-3063
> URL: https://issues.apache.org/jira/browse/NUTCH-3063
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Isabelle Giguere
>Assignee: Isabelle Giguere
>Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3063.patch
>
>
> NUTCH-1785 added the possibility of requesting the raw binary content, with 
> arg `addBinaryContent`, and possibly encode it as `base64`.
> This functionality should also be supported from the REST API.
> Integrating Nutch using the CLI is out of the question for some applications, 
> and at the same time some may need the raw content for further processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3065) Format changelog as Markdown

2024-09-05 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879666#comment-17879666
 ] 

Sebastian Nagel commented on NUTCH-3065:


PR in progress: the [reformatted 
changelog|https://github.com/sebastian-nagel/nutch/blob/NUTCH-3065-changelog-markdown/CHANGES.md]

> Format changelog as Markdown
> 
>
> Key: NUTCH-3065
> URL: https://issues.apache.org/jira/browse/NUTCH-3065
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Together with the release of 1.20 the changelog was renamed from CHANGES.txt 
> to CHANGES.md, i.e. it should also follow the Markdown syntax.
> - Markdown allows for links and is displayed nicely on many web-based Git 
> front-ends
> - see the [1.20 
> CHANGES.md|https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md] on 
> GitHub
> - see also the discussion in the [1.20 VOTE 
> thread|https://lists.apache.org/thread/vc2rdwqbnq47nl23d88gth1dpnsgfccr]
> This issue is about to address two points:
> - add the 1.20 release notes as Markdown to CHANGES.md
> - reformat previous release notes as far as possible



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3065) Format changelog as Markdown

2024-09-05 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3065:
--

Assignee: Sebastian Nagel

> Format changelog as Markdown
> 
>
> Key: NUTCH-3065
> URL: https://issues.apache.org/jira/browse/NUTCH-3065
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.21
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Together with the release of 1.20 the changelog was renamed from CHANGES.txt 
> to CHANGES.md, i.e. it should also follow the Markdown syntax.
> - Markdown allows for links and is displayed nicely on many web-based Git 
> front-ends
> - see the [1.20 
> CHANGES.md|https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md] on 
> GitHub
> - see also the discussion in the [1.20 VOTE 
> thread|https://lists.apache.org/thread/vc2rdwqbnq47nl23d88gth1dpnsgfccr]
> This issue is about to address two points:
> - add the 1.20 release notes as Markdown to CHANGES.md
> - reformat previous release notes as far as possible



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3065) Format changelog as Markdown

2024-09-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3065:
--

 Summary: Format changelog as Markdown
 Key: NUTCH-3065
 URL: https://issues.apache.org/jira/browse/NUTCH-3065
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.21
Reporter: Sebastian Nagel
 Fix For: 1.21


Together with the release of 1.20 the changelog was renamed from CHANGES.txt to 
CHANGES.md, i.e. it should also follow the Markdown syntax.
- Markdown allows for links and is displayed nicely on many web-based Git 
front-ends
- see the [1.20 
CHANGES.md|https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md] on 
GitHub
- see also the discussion in the [1.20 VOTE 
thread|https://lists.apache.org/thread/vc2rdwqbnq47nl23d88gth1dpnsgfccr]

This issue is about to address two points:
- add the 1.20 release notes as Markdown to CHANGES.md
- reformat previous release notes as far as possible




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3060) Javadoc link broken on website

2024-08-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3060:
---
Description: The link to the 1.20 Javadocs on 
[https://nutch.apache.org/documentation/javadoc/] is broken: the target 
([https://nutch.apache.org/documentation/javadoc/api/index.html]) does not 
exist.  (was: The link to the 1.20 Javadocs on 
[https://nutch.apache.org/documentation/javadoc/] is broken: the target 
([https://nutch.apache.org/documentation/javadoc/api/index.html)] does not 
exist.)

> Javadoc link broken on website
> --
>
> Key: NUTCH-3060
> URL: https://issues.apache.org/jira/browse/NUTCH-3060
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, website
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The link to the 1.20 Javadocs on 
> [https://nutch.apache.org/documentation/javadoc/] is broken: the target 
> ([https://nutch.apache.org/documentation/javadoc/api/index.html]) does not 
> exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3060) Javadoc link broken on website

2024-08-01 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870291#comment-17870291
 ] 

Sebastian Nagel commented on NUTCH-3060:


The missing Javadocs are now placed on staging: 
https://nutch.staged.apache.org/documentation/javadoc/api/index.html - after 
some verification I'll place them also on "production".

I was able to reproduce the problem with Hugo 0.123.7: HTML files are not 
served (or copied to public) if they do not include a front matter. Cf. 
[hugo#12008](https://github.com/gohugoio/hugo/pull/12008) and 
https://gohugo.io/content-management/formats/. The latter states: "Regardless 
of content format, all content must have front matter, preferably including 
both title and date."

I'll keep this issue open until we have a solution or a *documented* 
work-around.

> Javadoc link broken on website
> --
>
> Key: NUTCH-3060
> URL: https://issues.apache.org/jira/browse/NUTCH-3060
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, website
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The link to the 1.20 Javadocs on 
> [https://nutch.apache.org/documentation/javadoc/] is broken: the target 
> ([https://nutch.apache.org/documentation/javadoc/api/index.html)] does not 
> exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3062) protocol-okhttp: optionally record HTTP and SSL/TLS versions

2024-07-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3062:
--

 Summary: protocol-okhttp: optionally record HTTP and SSL/TLS 
versions
 Key: NUTCH-3062
 URL: https://issues.apache.org/jira/browse/NUTCH-3062
 Project: Nutch
  Issue Type: Improvement
  Components: protocol
Affects Versions: 1.21
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Background: WARC headers may record the HTTP and SSL/TLS versions (see 
[iipc/warc-specifications#42|https://github.com/iipc/warc-specifications/issues/42])
 and the SSL/TLS cipher suites (see 
[iipc/warc-specifications#86|https://github.com/iipc/warc-specifications/issues/86]).
 This issue is about to track the necessary information, for now, in 
protocol-okhttp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3061) URL filters to log name of the rule file rules are read from

2024-07-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3061:
--

 Summary: URL filters to log name of the rule file rules are read 
from
 Key: NUTCH-3061
 URL: https://issues.apache.org/jira/browse/NUTCH-3061
 Project: Nutch
  Issue Type: Improvement
  Components: urlfilter
Affects Versions: 1.21
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Some of the URL filters already log the name of the rule file from which rules 
are read. This is helpful if a custom log file is defined in the configuration. 
The following do not yet log the rules file name: urlfilter-regex, 
urlfilter-automaton, urlfilter-fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3060) Javadoc link broken on website

2024-06-28 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3060:
--

 Summary: Javadoc link broken on website
 Key: NUTCH-3060
 URL: https://issues.apache.org/jira/browse/NUTCH-3060
 Project: Nutch
  Issue Type: Bug
  Components: documentation, website
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.20


The link to the 1.20 Javadocs on 
[https://nutch.apache.org/documentation/javadoc/] is broken: the target 
([https://nutch.apache.org/documentation/javadoc/api/index.html)] does not 
exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3060) Javadoc link broken on website

2024-06-28 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3060:
---
Fix Version/s: 1.21
   (was: 1.20)

> Javadoc link broken on website
> --
>
> Key: NUTCH-3060
> URL: https://issues.apache.org/jira/browse/NUTCH-3060
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, website
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The link to the 1.20 Javadocs on 
> [https://nutch.apache.org/documentation/javadoc/] is broken: the target 
> ([https://nutch.apache.org/documentation/javadoc/api/index.html)] does not 
> exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3059:
--

 Summary: Generator: selector job does not count reduce output 
records
 Key: NUTCH-3059
 URL: https://issues.apache.org/jira/browse/NUTCH-3059
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


The selector step (job) of the Generator does not count the reduce output 
records resp. shows the count "0":
{noformat}
2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting

2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting 
best-scoring urls due for fetch.
...
         Map-Reduce Framework
                Map input records=6
                Map output records=6
                ...
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=594
                Reduce input records=6
                Reduce output records=0
                Spilled Records=12
                ...
{noformat}
Not a big issue but should investigate why this happens. The other counters 
seem to work properly, also the partitioner job shows the reduce output 
records. The issue is observed in local and distributed mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3058) Fetcher: counter for hung threads

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3058:
--

 Summary: Fetcher: counter for hung threads
 Key: NUTCH-3058
 URL: https://issues.apache.org/jira/browse/NUTCH-3058
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce task 
timeout, see {{mapreduce.task.timeout}} and 
{{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but 
without any progress during the timeout period (in terms of newly started fetch 
items), Fetcher is shut down to avoid that the task timeout is reached and the 
fetcher job is failed. The "hung threads" are logged together with the URL 
being fetched and (DEBUG level) the Java stack.

In addition to logging, a job counter should indicate the number of hung 
threads. This would allow to see on the job level whether there are issues with 
hung threads. To trace the issues it's still required to look into the Hadoop 
task logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3055) README: fix Github "hub" commands

2024-05-28 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3055.

Resolution: Fixed

> README: fix Github "hub" commands
> -
>
> Key: NUTCH-3055
> URL: https://issues.apache.org/jira/browse/NUTCH-3055
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> The [README.md|https://github.com/apache/nutch/blob/master/README.md] 
> contains [Github hub|https://hub.github.com/] commands but with "git" as 
> command (executable) name, maybe an alias or some other magic. However, if 
> hub isn't installed, these commands fail with {{git: 'pull-request' is not a 
> git command. See 'git --help'.}} or similar.
> We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-05-28 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3044.

Resolution: Fixed

> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3043.

Resolution: Implemented

> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3039) Failure to handle ftp:// URLs

2024-05-14 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3039.

Resolution: Fixed

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt"; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt";
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3055:
--

 Summary: README: fix Github "hub" commands
 Key: NUTCH-3055
 URL: https://issues.apache.org/jira/browse/NUTCH-3055
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The [README.md|https://github.com/apache/nutch/blob/master/README.md] contains 
[Github hub|https://hub.github.com/] commands but with "git" as command 
(executable) name, maybe an alias or some other magic. However, if hub isn't 
installed, these commands fail with {{git: 'pull-request' is not a git command. 
See 'git --help'.}} or similar.

We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842291#comment-17842291
 ] 

Sebastian Nagel commented on NUTCH-3028:


+1 lgtm.

One question: if there is no parseData, the JEXL expression is not evaluated. 
Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, 
successfully parsing a document is not a requirement to archive it in a WARC 
file. Might be useful to have the JEXL filtering also available for unparsed 
docs.

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842284#comment-17842284
 ] 

Sebastian Nagel commented on NUTCH-3045:


See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887 are done, we might be 
forced to upkeep JDK 11 runtime compatibility, so that Nutch runs on recent 
Hadoop versions and distributions. I fully agree that Java 17 offers some nice 
syntax improvements, though. :)

> Upgrade from Java 11 to 17
> --
>
> Key: NUTCH-3045
> URL: https://issues.apache.org/jira/browse/NUTCH-3045
> Project: Nutch
>  Issue Type: Task
>  Components: build, ci/cd
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.21
>
>
> This parent issue will track and organize work pertaining to upgrading Nutch 
> to JDK 17.
> Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Consolidating Nutch Continuous Integration

2024-04-28 Thread Sebastian Nagel

Hi Lewis,

> The Jenkins job used to be run nightly but
> no longer is.

It pulls nightly from git:
  https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/scmPollLog/
but a build is only run if there are new commits. The latest one:
  https://lists.apache.org/thread/ywtlmdmckhd21c6y9c77z01q17h42jww

Of course, we could add nightly builds on Github, in addition to the
builds when pull requests are opened.

> is there any preference on choosing one (Jenkins
> Vs GitHub Actions) over the other?

From my side: no. It may not harm to have both.

Best,
Sebastian

On 4/25/24 16:40, lewis john mcgibbney wrote:

Hi dev@,

We currently maintains a combination of Jenkins [0] and GitHub Actions [1] for 
CI.

For the longest time, we relied solely on Jenkins. This was really useful 
particularly when committers were pulling build artifacts from Jenkins nightly 
and relied on SVN trunk being stable. The Jenkins job used to be run nightly but 
no longer is. It is not clear exactly when nightly SNAPSHOT builds were turned off.


In 2020 we accepted a pull request [2] which established GitHub Actions and 
since then have gradually added small but important updates to the GitHub 
Actions workflow [3].


I can elaborate on the details of what each CI workflow does (it is not overly 
complex) but before I do that, is there any preference on choosing one (Jenkins 
Vs GitHub Actions) over the other?


Thanks

lewismc

[0] https://ci-builds.apache.org/job/Nutch/ 

[1] 
https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml 

[2] 
https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06 

[3] 
https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml 


--
http://home.apache.org/~lewismc/ 
http://people.apache.org/keys/committer/lewismc 



[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3044:
--

 Summary: Generator: NPE when extracting the host part of a URL 
fails
 Key: NUTCH-3044
 URL: https://issues.apache.org/jira/browse/NUTCH-3044
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


When extracting the host part of a URL fails, the Generator job fails because 
of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
contains an malformed URL, for example, a URL with an unsupported scheme 
(smb://).

{noformat}
Caused by: java.lang.NullPointerException
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3043:
--

 Summary: Generator: count URLs rejected by URL filters
 Key: NUTCH-3043
 URL: https://issues.apache.org/jira/browse/NUTCH-3043
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
interval or status. It should also count the number of URLs rejected by URL 
filters.

See also [Generator 
metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3040:
--

 Summary: Upgrade to Hadoop 3.4.0
 Key: NUTCH-3040
 URL: https://issues.apache.org/jira/browse/NUTCH-3040
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


[Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released.

Many dependencies are upgraded, including commons-io 2.14.0 which would have 
saved us a lot of work in NUTCH-2959.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Sebastian Nagel

Hi Lewis,

here's my +1

 * signatures of release packages are valid
 * build from the source package successful, unit tests pass
 * tested few Nutch tools in the binary package (local mode)
 * run a sample crawl and tested many Nutch tools on a single-node cluster
   running Hadoop 3.4.0, see
   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/

One note about the CHANGES.md: it's now a mixture of HTML and plain text.
It does not use the potential of markdown, e.g. sections / headlines for
the releases to make the change log navigable via a table of contents.
The embedded HTML makes it less readable if viewed in a text editor.
The rendering on Github [5] is acceptable with only minor glitches,
mostly the placement of multiple lines in a single paragraph:
  https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md
We also have a change log on Jira:
  https://s.apache.org/ovjf3
That's why I wouldn't call the CHANGES.md a "blocker". We should
update the formatting after the release to make it again easily
readable in source code and improve the document structure utilizing
the markdown markup.

~Sebastian

On 4/9/24 23:28, lewis john mcgibbney wrote:

Hi Folks,

A first candidate for the Nutch 1.20 release is available at [0] where 
accompanying SHA512 and ASC signatures can also be found.

Information on verifying releases can be found at [1].

The release candidate comprises a .zip and tar.gz archive of the sources at [2] 
and complementary binary distributions. In addition, a staged maven repository 
is available at [3].


The Nutch 1.20 release report is available at [4].

Please vote on releasing this package as Apache Nutch 1.20. The vote is open for 
at least the next 72 hours and passes if a majority of at least three +1 Nutch 
PMC votes are cast.


[ ] +1 Release this package as Apache Nutch X.XX.

[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.20 
<https://dist.apache.org/repos/dist/dev/nutch/1.20>
[1] http://nutch.apache.org/downloads.html#verify 
<http://nutch.apache.org/downloads.html#verify>
[2] https://github.com/apache/nutch/tree/release-1.20 
<https://github.com/apache/nutch/tree/release-1.20>
[3] https://repository.apache.org/content/repositories/orgapachenutch-1021/ 
<https://repository.apache.org/content/repositories/orgapachenutch-1021/>

[4] https://s.apache.org/ovjf3 <https://s.apache.org/ovjf3>

--
http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/>
http://people.apache.org/keys/committer/lewismc 
<http://people.apache.org/keys/committer/lewismc>


[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3039:
--

Assignee: Sebastian Nagel

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>    Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt"; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt";
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3039:
--

 Summary: Failure to handle ftp:// URLs
 Key: NUTCH-3039
 URL: https://issues.apache.org/jira/browse/NUTCH-3039
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.21


Nutch fails to handle ftp:// URLs:
- URLNormalizerBasic returns the empty string because creating the URL instance 
fails with a MalformedURLException:
  {noformat}
echo "ftp://ftp.example.com/path/file.txt"; \
  | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
- fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to 
a MalformedURLException:
  {noformat}
bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
   "ftp://ftp.example.com/path/file.txt";
...
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
java.net.MalformedURLException
at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
...{noformat}


The issue is caused by NUTCH-2429:
- we do not provide a dedicated URL stream handler for ftp URLs
- but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2937.

Resolution: Fixed

Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]!

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >