[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3056:
-
Description: 
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector. Seeds not leading to 
a non-200 URL will be discarded. Enabling filtering and normalization is highly 
recommended for handling the redirects.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.

  was:
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.


> Injector to support resolving seed URLs
> ---
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#00}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector. Seeds not leading 
> to a non-200 URL will be discarded. Enabling filtering and normalization is 
> highly recommended for handling the redirects.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3056:
-
Description: 
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.

  was:
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.


> Injector to support resolving seed URLs
> ---
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#00}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3056:


 Summary: Injector to support resolving seed URLs
 Key: NUTCH-3056
 URL: https://issues.apache.org/jira/browse/NUTCH-3056
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.21


We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384
 ] 

Markus Jelsma commented on NUTCH-3028:
--

Ok, the Content object is now also available in the evaluation. I added an 
example of it to the description above.

 

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028-2.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Description: 
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}

{color:#00}or {color}

{color:#00}-expr 
'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}

  was:
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}


> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836133#comment-17836133
 ] 

Markus Jelsma commented on NUTCH-3039:
--

Thanks for spotting that!

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827048#comment-17827048
 ] 

Markus Jelsma commented on NUTCH-3029:
--

comment describing throws is also required these days.

   a8ec17ca8..98902236d  master -> master

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826823#comment-17826823
 ] 

Markus Jelsma commented on NUTCH-3029:
--

throws was missing too

   84cda2abd..a8ec17ca8  master -> master

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826783#comment-17826783
 ] 

Markus Jelsma commented on NUTCH-3029:
--

Thanks Lewis!

   5ba50c0c6..84cda2abd  master -> master



 

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826759#comment-17826759
 ] 

Markus Jelsma commented on NUTCH-3029:
--

   4f62dec0f..5ba50c0c6  master -> master



actual change was missing from the commit for some reason

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826760#comment-17826760
 ] 

Markus Jelsma commented on NUTCH-3033:
--

Ah, the new ivy works like a charm!

Thanks!

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3029.
--
Resolution: Fixed

Thanks Martin!

   551c50b1c..4642c30c2  master -> master

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3030.
--
Resolution: Fixed

42b55f6a9..551c50b1c  master -> master

 

Thanks Martin!

 

> Use system default cipher suites instead of hard-coded set
> --
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3030:
-
Summary: Use system default cipher suites instead of hard-coded set  (was: 
Update default TLS cipher suites for http(s) protocol)

> Use system default cipher suites instead of hard-coded set
> --
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825863#comment-17825863
 ] 

Markus Jelsma commented on NUTCH-3032:
--

No idea what git fork is supposed to do, maybe it should be a git branch 
instead. I am not an skilled Git user, but you can always attach a patch to 
this ticket.

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-12 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3031.
--
Resolution: Fixed

   83acd501e..c390dfc8b  master -> master

> ProtocolFactory host mapper to support domains
> --
>
> Key: NUTCH-3031
> URL: https://issues.apache.org/jira/browse/NUTCH-3031
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-3031.patch
>
>
> Currently ProtocolFactory supports different protocol plugins based on the 
> host configured for it. This patch will add support for listing domains as 
> well so you don't have to list numerous subdomains for one larger domain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3031:
-
Attachment: NUTCH-3031.patch

> ProtocolFactory host mapper to support domains
> --
>
> Key: NUTCH-3031
> URL: https://issues.apache.org/jira/browse/NUTCH-3031
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-3031.patch
>
>
> Currently ProtocolFactory supports different protocol plugins based on the 
> host configured for it. This patch will add support for listing domains as 
> well so you don't have to list numerous subdomains for one larger domain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3031:


 Summary: ProtocolFactory host mapper to support domains
 Key: NUTCH-3031
 URL: https://issues.apache.org/jira/browse/NUTCH-3031
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.20


Currently ProtocolFactory supports different protocol plugins based on the host 
configured for it. This patch will add support for listing domains as well so 
you don't have to list numerous subdomains for one larger domain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818531#comment-17818531
 ] 

Markus Jelsma commented on NUTCH-3030:
--

For some reason the attached patch did not apply cleanly (error on line 96), 
added new patch that does apply without complaining.

> Update default TLS cipher suites for http(s) protocol
> -
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3030:
-
Attachment: NUTCH-3030.patch

> Update default TLS cipher suites for http(s) protocol
> -
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-3030:


Assignee: Markus Jelsma

> Update default TLS cipher suites for http(s) protocol
> -
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-02-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-3029:


Assignee: Markus Jelsma

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815345#comment-17815345
 ] 

Markus Jelsma commented on NUTCH-3028:
--

New patch: when expression was not set, an exception was raised.

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028-1.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-06 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814731#comment-17814731
 ] 

Markus Jelsma commented on NUTCH-3028:
--

Any objections to this one before i get it in?

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Description: 
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3027.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: (was: NUTCH-3027.patch)

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3028:


 Summary: WARCExported to support filtering by JEXL
 Key: NUTCH-3028
 URL: https://issues.apache.org/jira/browse/NUTCH-3028
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3027 started by Markus Jelsma.

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>Assignee: Markus Jelsma
>Priority: Trivial
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808614#comment-17808614
 ] 

Markus Jelsma commented on NUTCH-3027:
--

Thanks Sascha Kehrli!

Committed  {color:#00}85fea6e46..6b0455454  master -> master{color}

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>Assignee: Markus Jelsma
>Priority: Trivial
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3027.
--
Fix Version/s: 1.20
   Resolution: Fixed

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.20
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-3027:


Assignee: Markus Jelsma

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>Assignee: Markus Jelsma
>Priority: Trivial
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771023#comment-17771023
 ] 

Markus Jelsma commented on NUTCH-1635:
--

Good point! No, we haven't seen this behaviour for the past decade or so. Let's 
close it!

Danke!

> New crawldb sometimes ends up in current
> 
>
> Key: NUTCH-1635
> URL: https://issues.apache.org/jira/browse/NUTCH-1635
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Priority: Major
>
> In some weird cases the newly created crawldb by updatedb ends up in 
> crawl/crawldb/current//. So instead of replacing current/, it ends up 
> inside current/! This causes the generator to fail.
> It's impossible to reliably reproduce the problem. It only happened a couple 
> of times in the last few years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-1635.

Resolution: Not A Problem

> New crawldb sometimes ends up in current
> 
>
> Key: NUTCH-1635
> URL: https://issues.apache.org/jira/browse/NUTCH-1635
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Priority: Major
>
> In some weird cases the newly created crawldb by updatedb ends up in 
> crawl/crawldb/current//. So instead of replacing current/, it ends up 
> inside current/! This causes the generator to fail.
> It's impossible to reliably reproduce the problem. It only happened a couple 
> of times in the last few years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3007) Fix impossible casts

2023-09-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769989#comment-17769989
 ] 

Markus Jelsma commented on NUTCH-3007:
--

+1 yes!

> Fix impossible casts
> 
>
> Key: NUTCH-3007
> URL: https://issues.apache.org/jira/browse/NUTCH-3007
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Spotbugs reports two occurrences of
>   Impossible cast from java.util.ArrayList to String[] in 
> org.apache.nutch.fetcher.Fetcher.run(Map, String)
> Both were introduced later into the {{run(Map args, String 
> crawlId)}} method and obviously never used (would throw a 
> ClassCastException). The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769988#comment-17769988
 ] 

Markus Jelsma commented on NUTCH-2852:
--

Seems just fine for these files +1

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-18 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766306#comment-17766306
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Thanks for picking it up. I am very happy this one is resolved now. Thanks 
Sebastian for testing!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764699#comment-17764699
 ] 

Markus Jelsma commented on NUTCH-2978:
--

You managed to get it up and running, as well when deployed on Hadoop? This 
ticket almost drove me to tears and despair :D

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764697#comment-17764697
 ] 

Markus Jelsma commented on NUTCH-3000:
--

Yes, this is a bit odd indeed. +1

> protocol-selenium returns only the body,strips off the  element
> --
>
> Key: NUTCH-3000
> URL: https://issues.apache.org/jira/browse/NUTCH-3000
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Tim Allison
>Priority: Major
>
> The selenium protocol returns only the body portion of the html, which means 
> that neither the title nor the other page metadata in the  section 
> gets extracted.
> {noformat}
> String innerHtml = driver.findElement(By.tagName("body"))
> .getAttribute("innerHTML");
> {noformat}
> We should return the full html, no?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760522#comment-17760522
 ] 

Markus Jelsma commented on NUTCH-2999:
--

Seems fine +1

> Update Lucene version to latest 8.x
> ---
>
> Key: NUTCH-2999
> URL: https://issues.apache.org/jira/browse/NUTCH-2999
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> It may be the way that I'm loading the project, but, for me, Intellij really 
> does not like the Lucene version conflict between {{scoring-similarity}} and 
> the OpenSearch/Elasticsearch modules.
> Can we bump Lucene to the latest 8.11.2 throughout?
> PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738191#comment-17738191
 ] 

Markus Jelsma commented on NUTCH-2993:
--

To be honest, i am not too happy with the implementation like this. Ideally we 
would regex all outlinks, but that will be even more costly. The crawler still 
ends up in bad sections of the site and further on the www, but with low depth 
settings, it is manageable.

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738047#comment-17738047
 ] 

Markus Jelsma commented on NUTCH-2993:
--

Thanks Sebastian!
 # changed the checks again.
 # check for empty/non-configured pattern in place
 # props added to config
 # try/catch in place
 # typo removed

Ran a few long crawls with just over a hundred domains. I changed the checks 
again. Now the maxDepth resets if it does NOT match/find the pattern. There was 
still a possibility of sitemap-like pages being passed an overridden maxDepth, 
due to a linking page matching the pattern, and then a whole site got crawled 
anyway.

 

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15-1.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993.patch)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Description: 
We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites.


This patch overrides maxDepth for outlinks of URLs matching a configured 
pattern. URL not matching the pattern get the default max depth value 
configured.

  was:
We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch skips the depth check if the current URL 
matches some regular expression.

 

Initially we tried to set a custom maxDepth based on a Pattern match, but this 
didn't work. The crawler still managed to creep too deep due to having links 
everywhere.


> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15-1.patch)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15.patch)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15-1.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Description: 
We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch skips the depth check if the current URL 
matches some regular expression.

 

Initially we tried to set a custom maxDepth based on a Pattern match, but this 
didn't work. The crawler still managed to creep too deep due to having links 
everywhere.

  was:We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch allows for a overridden max depth if the 
current URL matches against a Pattern. If find(), all outlinks are given a new 
max depth.


> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Summary: ScoringDepth plugin to skip depth check based on URL Pattern  
(was: ScoringDepth plugin to override maxDepth based on URL Pattern)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15.patch)

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15.patch)

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730535#comment-17730535
 ] 

Markus Jelsma commented on NUTCH-2993:
--

Here's a simple patch against Nutch 1.15. Will patch for master later.

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2993:


 Summary: ScoringDepth plugin to override maxDepth based on URL 
Pattern
 Key: NUTCH-2993
 URL: https://issues.apache.org/jira/browse/NUTCH-2993
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.20


We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch allows for a overridden max depth if the 
current URL matches against a Pattern. If find(), all outlinks are given a new 
max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2985) Disable plugin urlfilter-validator by default

2023-02-24 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693291#comment-17693291
 ] 

Markus Jelsma commented on NUTCH-2985:
--

+1

> Disable plugin urlfilter-validator by default
> -
>
> Key: NUTCH-2985
> URL: https://issues.apache.org/jira/browse/NUTCH-2985
> Project: Nutch
>  Issue Type: Bug
>  Components: configuration, urlfilter
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The plugin urlfilter-validator is activated by default (in nutch-default.xml) 
> but has two major issues which may confuse users of Nutch:
> - single-part domain names (localhost, etc.) are not allowed (NUTCH-2973)
> - IPv6 host names are rejected as invalid (NUTCH-2705)
> What about disabling it by default to overcome these issues?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2974) Ant build fails with "Unparseable date" on certain platforms

2023-01-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677435#comment-17677435
 ] 

Markus Jelsma commented on NUTCH-2974:
--

Sounds like a nice solution for this obscure bug +1

> Ant build fails with "Unparseable date" on certain platforms
> 
>
> Key: NUTCH-2974
> URL: https://issues.apache.org/jira/browse/NUTCH-2974
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> When touching the configuration templates the ant build fails on certain 
> platforms, see NUTCH-2512 and recently by [Kamil Mroczek on the users 
> list|https://lists.apache.org/thread/dc36ofc6kvvx3fxlqbnzqdcp73yjcj8m], 
> including a fix.
> However, we should also consider removing the "touch" action if it's not 
> clear what the purpose of it is - it's there since the initial import of the 
> Nutch source code to the Apache repository. Could be obsolete now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-06 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655383#comment-17655383
 ] 

Markus Jelsma commented on NUTCH-2634:
--

+1

> Some links marked as "nofollow" are followed anyway.
> 
>
> Key: NUTCH-2634
> URL: https://issues.apache.org/jira/browse/NUTCH-2634
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.20
>
>
> In order to check if an outlink in an  tag can be followed, nutch checks 
> whether the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link 
> types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], 
> all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> DO NOT FOLLOW THIS LINK
> {code}
> but wrongfully follows :
> {code:html}
> DO NOT FOLLOW THIS 
> LINK
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed 
> in two places:
> # 
> [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> # 
> [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243
 ] 

Markus Jelsma edited comment on NUTCH-2978 at 12/22/22 11:33 AM:
-

Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, 
although it does work. We now get:

{color:#00}java.util.ServiceConfigurationError: 
org.apache.logging.log4j.spi.Provider: 
org.apache.logging.log4j.core.impl.Log4jProvider not a subtype

{color}

{color:#00}There are no multiple versions of the same logging JARs anywhere 
on the classpath.{color}


was (Author: markus17):
Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, 
although it does work. We now get:

{color:#00}java.util.ServiceConfigurationError: 
org.apache.logging.log4j.spi.Provider: 
org.apache.logging.log4j.core.impl.Log4jProvider not a subtype{color}

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, 
although it does work. We now get:

{color:#00}java.util.ServiceConfigurationError: 
org.apache.logging.log4j.spi.Provider: 
org.apache.logging.log4j.core.impl.Log4jProvider not a subtype{color}

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648636#comment-17648636
 ] 

Markus Jelsma commented on NUTCH-2978:
--

New patch now makes sure there is a log4j 2.19 in tika and mentioned in its 
plugin.xml, otherwise above will happen. Now i am not sure the other plugins 
are still ok.

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-3.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648633#comment-17648633
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ok, i also wanted to get rid of loose log4j libs. There was still one in any23 
and parse-tika. When removing the lib from parse-tika, lots of bad things 
happen.
{code:java}
22/12/16 13:36:03 WARN ooxml.OPCPackageDetector: Unable to load 
org.apache.tika.detect.microsoft.ooxml.OPCPackageDetector
java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
        at org.apache.poi.ooxml.POIXMLRelation.(POIXMLRelation.java:54)
        at 
org.apache.tika.detect.microsoft.ooxml.OPCPackageDetector.(OPCPackageDetector.java:106)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at java.base/java.lang.Class.newInstance(Class.java:584)
        at 
org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:312)
        at 
org.apache.tika.detect.zip.DefaultZipContainerDetector.(DefaultZipContainerDetector.java:85)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at 
org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:78)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
        at 
org.apache.tika.detect.DefaultDetector.getDefaultDetectors(DefaultDetector.java:90)
        at 
org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:50)
        at 
org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:55)
        at 
org.apache.tika.config.TikaConfig.getDefaultDetector(TikaConfig.java:264)
        at 
org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:1017)
        at 
org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:975)
        at 
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:630)
        at org.apache.tika.config.TikaConfig.(TikaConfig.java:155)
        at org.apache.tika.config.TikaConfig.(TikaConfig.java:145)
        at org.apache.tika.config.TikaConfig.(TikaConfig.java:120)
        at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:276)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:177)
        at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:75)
        at 
org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:245)
        at 
org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:87)
        at 
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:136)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
        at 
org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:316)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager
        at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:105)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:93)
       

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648625#comment-17648625
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Patch now includes Sebastian's patch, and actually contains the upgrade from 
old slf4j to the new 2.0.6. Tested on Hadoop 3.3.4 cluster with a parsing 
fetcher. This went just fine.

-I must admist that those slf4js and jcl-over-slf remaining in the plugins do 
bother me to some degree.-

New patch now includes exclusions to get rid of all of them.

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-2.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: (was: NUTCH-2978-1.patch)

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, 
> NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-1.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, 
> NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-1.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, 
> NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-15 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648060#comment-17648060
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ah yes, thanks! I am not sure if a 'solution' will come from Tika, that 
specific package seems to be shaded in all versions between 2.3.0 and 2.6.0. 
But, we, ASF Nutch, do not depend on it so we are good.

Patched like this, Nutch will fetch/parse just fine when running on Hadoop. I 
did get this when doing an indexchecker using the job file:

{color:#00}ERROR StatusLogger Log4j2 could not find a logging 
implementation. Please add log4j-core to the classpath. Using SimpleLogger to 
log to the console...{color}


 

{color:#00}However, logging worked just fine.{color}

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646681#comment-17646681
 ] 

Markus Jelsma commented on NUTCH-2978:
--

About the slf issues,

Somewhere another slf4j jar was lurking in the job file, but i couldn't find it 
for a long while. Until i saw there was a slf4j jar packaged within the 
tika-parser-scientific-package! I got rid of it, then got a xerces/xml-apis 
error, which i then also excluded. Now there are many other errors.

Something to look out for when upgrading Tika. But for some reason, although we 
are using the same Tika version, that specific package does not appear as a 
dependency of Tika in in Nutch' vanilla. That may change later.

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-12 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2924.
--
Resolution: Fixed

{color:#00}To https://gitbox.apache.org/repos/asf/nutch.git {color}
  d806aa450..7d3900450  master -> master

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838
 ] 

Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:34 PM:
---

Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just 
this patch, and the generator patch. It works!

Not sure why our parser stuff fails, but at least Nutch' stuff is working! But 
we both use a LoggerFactory.getLogger invocation, the original TikaParser 
invocation works, mine doesn't.


was (Author: markus17):
Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just 
this patch, and the generator patch. It works!

Not sure why our parser stuff fails, but at least Nutch' stuff is working!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just 
this patch, and the generator patch. It works!

Not sure why our parser stuff fails, but at least Nutch' stuff is working!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825
 ] 

Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:12 PM:
---

This morning i saw one of our internal projects spewing the same error as 
any23, it was quickly remedied by upgrading a dependency further down the line. 
Not sure if this will go as easy with the any23 plugin, i'll take a look

Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 
3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a 
generate, followed by a fetch. The fetch failed due to LinkageError in our 
parser plugin, similar as parse-tika. Too bad.

A local indexchecker runs fine, an indexchecker using a job file fails with the 
same error.

Removing all reload4j references is not solving it, as expected. Not sure what 
to do now.


was (Author: markus17):
This morning i saw one of our internal projects spewing the same error as 
any23, it was quickly remedied by upgrading a dependency further down the line. 
Not sure if this will go as easy with the any23 plugin, i'll take a look

Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 
3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a 
generate, followed by a fetch. The fetch failed due to LinkageError in our 
parser plugin, similar as parse-tika. Too bad.

A local indexchecker runs fine, an indexchecker using a job file fails with the 
same error.

 

 

 

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825
 ] 

Markus Jelsma commented on NUTCH-2978:
--

This morning i saw one of our internal projects spewing the same error as 
any23, it was quickly remedied by upgrading a dependency further down the line. 
Not sure if this will go as easy with the any23 plugin, i'll take a look

Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 
3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a 
generate, followed by a fetch. The fetch failed due to LinkageError in our 
parser plugin, similar as parse-tika. Too bad.

A local indexchecker runs fine, an indexchecker using a job file fails with the 
same error.

 

 

 

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644808#comment-17644808
 ] 

Markus Jelsma commented on NUTCH-2924:
--

Here's the proper patch, finally.

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2924:
-
Attachment: NUTCH-2924-5.patch

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644491#comment-17644491
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Yes, i saw the slf4j present in the plugin, it troubled my already when i 
attempted an upgrade to a newer Tika version.

Regarding reload4j, i was already worried it might not run in distributed mode 
but haven't tested it yet. For now i am glad enough Nutch runs our Tika based 
parser in local mode.

To be continued

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644489#comment-17644489
 ] 

Markus Jelsma commented on NUTCH-2924:
--

Yes, that is expected. This patch requires a hostdb to be configured and 
present, i will add a check for that.

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2977.
--
Fix Version/s: 1.20
   Resolution: Fixed

> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.
>  
> $ ant dependencytree
>  
> will now show the tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644436#comment-17644436
 ] 

Markus Jelsma commented on NUTCH-2977:
--

{color:#00}Committed:{color}
  ed7b6615b..d806aa450  master -> master

> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.
>  
> $ ant dependencytree
>  
> will now show the tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Description: 
I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
log4j -> reload4j.

 

This patch fixes it.

  was:
I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
log4j -> reload4j.

 

 


> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Description: 
I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
log4j -> reload4j.

 

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2978:


 Summary: Move to slf4j2 and remove log4j1 and reload4j
 Key: NUTCH-2978
 URL: https://issues.apache.org/jira/browse/NUTCH-2978
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
 Attachments: NUTCH-2978.patch





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2977:
-
Attachment: NUTCH-2977.patch

> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2977:
-
Description: 
I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
especially reload4j. I desperately need this function for that.

 

$ and dependencytree

 

will now show the tree.

  was:I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
especially reload4j. I desperately need this function for that.


> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.
>  
> $ and dependencytree
>  
> will now show the tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2977:


 Summary: Support for showing dependency tree
 Key: NUTCH-2977
 URL: https://issues.apache.org/jira/browse/NUTCH-2977
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma


I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
especially reload4j. I desperately need this function for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644298#comment-17644298
 ] 

Markus Jelsma commented on NUTCH-2924:
--

Updated patch for master.

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2924:
-
Attachment: NUTCH-2924-4.patch

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2973) Single domain names (eg https://localnet) can't be crawled - filtering fails

2022-10-20 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621023#comment-17621023
 ] 

Markus Jelsma commented on NUTCH-2973:
--

Hello David,

By default urlfilter-validator is an active plugin, and it will reject 
everything localhost, or localnet. You can disable the plugin in your 
plugin.includes configuration directive.

 

 

> Single domain names (eg https://localnet) can't be crawled - filtering fails
> 
>
> Key: NUTCH-2973
> URL: https://issues.apache.org/jira/browse/NUTCH-2973
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.19
> Environment: Nutch 1.19, checked on Windows 10 and Ubuntu.  Both have 
> the same issue. 
> 'm trying to crawl a SharePoint intranet using nutch where the URLs are 
> similar to:
>  
> {{https://localnet/something.aspx}}
> The issue is that Nutch is rejecting any url with a single element domain 
> name such as localnet above. "localnet.com" is not rejected, nor is 
> "local.localnet". It almost feels as if there's a chunk of code within Nutch 
> that's unrelated to the filtering mechanisms that rejects URLs outright if 
> they don't have a WWW style format and a WWW-style domain such as .COM
> Error message:
>  
> {{Total urls rejected by filters: 1}}
> I've checked and updated all the _filter_ files in the conf directory. Even 
> making then incredibly permissive (effectively "crawl everything") has not 
> helped.
>Reporter: David Smith
>Priority: Blocker
>
> There appears to be a bug within the core of Nutch that fails to permit any 
> single domain name URLs to be crawled.  Example:
> {{https://{*}localnet{*}/something.aspx}}
> The issue is that Nutch is rejecting any url with a single element domain 
> name such as *localnet* above. "localnet.com" is not rejected, nor is 
> "local.localnet". It almost feels as if there's a chunk of code within Nutch 
> that's unrelated to the filtering mechanisms that rejects URLs outright if 
> they don't have a WWW style format and a WWW-style domain such as .COM
> Error message:
> {{Total urls rejected by filters: 1}}
> I've checked and updated all the filter files in the conf directory. Even 
> making then incredibly permissive (effectively "crawl everything") has not 
> helped.    Immediately that a dot (.) is added to the domain name it is not 
> rejected - eg blah.localnet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2969) Javadoc: Javascript search is not working when built on JDK 11

2022-08-22 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582978#comment-17582978
 ] 

Markus Jelsma commented on NUTCH-2969:
--

Nice!

> Javadoc: Javascript search is not working when built on JDK 11
> --
>
> Key: NUTCH-2969
> URL: https://issues.apache.org/jira/browse/NUTCH-2969
> Project: Nutch
>  Issue Type: Bug
>  Components: build, documentation
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Newer Java versions create a nice Javascript-backed search box on the 
> Javadocs. When built on JDK 11 clicking on search results leads to 404 
> results with a {{/undefined/}} path element in the URL. This is caused by 
> [JDK-8215291|https://bugs.openjdk.org/browse/JDK-8215291] - Nutch does not 
> use Java modules.
> The issue can be solved by adding {{--no-module-directories}} as argument to 
> the {{javadoc}} command. However, this only works for JDK 11 and will break 
> the javadoc build for later JDK versions, see 
> [JDK-8215582|https://bugs.openjdk.org/browse/JDK-8215582]. 
> See also
> - 
> https://stackoverflow.com/questions/52326318/maven-javadoc-search-redirects-to-undefined-url
> - https://github.com/crawler-commons/crawler-commons/pull/380
> Notes:
> - seen while preparing a release candidate for 1.19 (will quickly commit the 
> solution)
> - adding {{--no-module-directories}} should be conditional for JDK 11 only
> -* ant: see [condition|https://ant.apache.org/manual/Tasks/condition.html]
> -* gradle:
> {noformat}
> if (JavaVersion.current().isJava11()) {
>   options.addBooleanOption("-no-module-directories", true)
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2022-08-22 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582859#comment-17582859
 ] 

Markus Jelsma commented on NUTCH-2960:
--

Yes, this would be much preferred over removing the binaries from the 
distribution!

> indexer-elastic: remove plugin from binary package to address licensing issues
> --
>
> Key: NUTCH-2960
> URL: https://issues.apache.org/jira/browse/NUTCH-2960
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The license of Elasticsearch has changed with v7.11.0 and upwards and is (if 
> correct) not more compatible with the Apache license. Accordingly, we should 
> not further ship Elastic jars with the binary package.
> It should be possible to keep the indexer-elastic plugin in the source 
> package as an [optional|https://www.apache.org/legal/resolved.html#optional] 
> dependency (indexer-solr is the default indexing backend and more are 
> available).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >