[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. Seeds not leading to a non-200 URL will be discarded. Enabling filtering and normalization is highly recommended for handling the redirects. If you have a seed file with 10k+ or millions of records, you are highly recommended to split the input file in chunks so that multiple mappers can get to work. Passing a few millions records without resolving through one mapper is no problem, but resolving millions with one mapper, even if threaded, will take many hours. was: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. If you have a seed file with 10k+ or millions of records, you are highly recommended to split the input file in chunks so that multiple mappers can get to work. Passing a few millions records without resolving through one mapper is no problem, but resolving millions with one mapper, even if threaded, will take many hours. > Injector to support resolving seed URLs > --- > > Key: NUTCH-3056 > URL: https://issues.apache.org/jira/browse/NUTCH-3056 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > > We have a case where clients submit huge uncurated seed files, the host may > not longer exist, or redirect via-via to elsewhere, the protocol may be > incorrect etc. > The large crawl itself is not supposed to venture much beyond the seed list, > except for regex exceptions listed in > {color:#00}db-ignore-external-exemptions{color}. It is also not allowed > to jump to other domains/hosts to control the size of the crawl. This means > externally redirecting seeds will not be crawled. > This ticket will add support for a multi-threaded > host/domain/protocol/redirecter/resolver to the injector. Seeds not leading > to a non-200 URL will be discarded. Enabling filtering and normalization is > highly recommended for handling the redirects. > If you have a seed file with 10k+ or millions of records, you are highly > recommended to split the input file in chunks so that multiple mappers can > get to work. Passing a few millions records without resolving through one > mapper is no problem, but resolving millions with one mapper, even if > threaded, will take many hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. If you have a seed file with 10k+ or millions of records, you are highly recommended to split the input file in chunks so that multiple mappers can get to work. Passing a few millions records without resolving through one mapper is no problem, but resolving millions with one mapper, even if threaded, will take many hours. was: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. > Injector to support resolving seed URLs > --- > > Key: NUTCH-3056 > URL: https://issues.apache.org/jira/browse/NUTCH-3056 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > > We have a case where clients submit huge uncurated seed files, the host may > not longer exist, or redirect via-via to elsewhere, the protocol may be > incorrect etc. > The large crawl itself is not supposed to venture much beyond the seed list, > except for regex exceptions listed in > {color:#00}db-ignore-external-exemptions{color}. It is also not allowed > to jump to other domains/hosts to control the size of the crawl. This means > externally redirecting seeds will not be crawled. > This ticket will add support for a multi-threaded > host/domain/protocol/redirecter/resolver to the injector. > If you have a seed file with 10k+ or millions of records, you are highly > recommended to split the input file in chunks so that multiple mappers can > get to work. Passing a few millions records without resolving through one > mapper is no problem, but resolving millions with one mapper, even if > threaded, will take many hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3056) Injector to support resolving seed URLs
Markus Jelsma created NUTCH-3056: Summary: Injector to support resolving seed URLs Key: NUTCH-3056 URL: https://issues.apache.org/jira/browse/NUTCH-3056 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.21 We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#00}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384 ] Markus Jelsma commented on NUTCH-3028: -- Ok, the Content object is now also available in the evaluation. I added an example of it to the description above. > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-2.patch > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata are exported to WARC. {color:#00}-expr 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} {color:#00}or {color} {color:#00}-expr 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} was: Filtering segment data to WARC is now possible using JEXL expressions. In the next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata are exported to WARC. {color:#00}-expr 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836133#comment-17836133 ] Markus Jelsma commented on NUTCH-3039: -- Thanks for spotting that! > Failure to handle ftp:// URLs > - > > Key: NUTCH-3039 > URL: https://issues.apache.org/jira/browse/NUTCH-3039 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > > Nutch fails to handle ftp:// URLs: > - URLNormalizerBasic returns the empty string because creating the URL > instance fails with a MalformedURLException: > {noformat} > echo "ftp://ftp.example.com/path/file.txt; \ > | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat} > - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due > to a MalformedURLException: > {noformat} > bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ >"ftp://ftp.example.com/path/file.txt; > ... > Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException > at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) > ...{noformat} > The issue is caused by NUTCH-2429: > - we do not provide a dedicated URL stream handler for ftp URLs > - but also do not pass ftp:// URLs to the standard JVM handler -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827048#comment-17827048 ] Markus Jelsma commented on NUTCH-3029: -- comment describing throws is also required these days. a8ec17ca8..98902236d master -> master > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.20 > > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826823#comment-17826823 ] Markus Jelsma commented on NUTCH-3029: -- throws was missing too 84cda2abd..a8ec17ca8 master -> master > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826783#comment-17826783 ] Markus Jelsma commented on NUTCH-3029: -- Thanks Lewis! 5ba50c0c6..84cda2abd master -> master > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826759#comment-17826759 ] Markus Jelsma commented on NUTCH-3029: -- 4f62dec0f..5ba50c0c6 master -> master actual change was missing from the commit for some reason > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3033) Upgrade Ivy to v2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826760#comment-17826760 ] Markus Jelsma commented on NUTCH-3033: -- Ah, the new ivy works like a charm! Thanks! > Upgrade Ivy to v2.5.2 > - > > Key: NUTCH-3033 > URL: https://issues.apache.org/jira/browse/NUTCH-3033 > Project: Nutch > Issue Type: Task > Components: ivy >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > Ivy v2.5.2 was released August 20th 2023. Let’s upgrade. > [https://ant.apache.org/ivy/history/2.5.2/release-notes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3029. -- Resolution: Fixed Thanks Martin! 551c50b1c..4642c30c2 master -> master > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3030) Use system default cipher suites instead of hard-coded set
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3030. -- Resolution: Fixed 42b55f6a9..551c50b1c master -> master Thanks Martin! > Use system default cipher suites instead of hard-coded set > -- > > Key: NUTCH-3030 > URL: https://issues.apache.org/jira/browse/NUTCH-3030 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch > > > If http.tls.supported.cipher.suites is not set in the configuration, it > defaults to a hard-coded list which is not exhaustive enough. I have > encountered websites that exclusively use ciphers which are not included, so > they could not be handled by protocol-http. > I changed this list to the system default -- SSLSocketFactory's > .getDefaultCipherSuites() to be precise. One could also use > .getSupportedCipherSuites() here, I suppose. > The original list should be moved to nutch-default.xml or omitted altogether. > The protocol list is still hard-coded, but it is now also added to > nutch-default.xml (so it can be easily changed manually if needed). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3030) Use system default cipher suites instead of hard-coded set
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3030: - Summary: Use system default cipher suites instead of hard-coded set (was: Update default TLS cipher suites for http(s) protocol) > Use system default cipher suites instead of hard-coded set > -- > > Key: NUTCH-3030 > URL: https://issues.apache.org/jira/browse/NUTCH-3030 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch > > > If http.tls.supported.cipher.suites is not set in the configuration, it > defaults to a hard-coded list which is not exhaustive enough. I have > encountered websites that exclusively use ciphers which are not included, so > they could not be handled by protocol-http. > I changed this list to the system default -- SSLSocketFactory's > .getDefaultCipherSuites() to be precise. One could also use > .getSupportedCipherSuites() here, I suppose. > The original list should be moved to nutch-default.xml or omitted altogether. > The protocol list is still hard-coded, but it is now also added to > nutch-default.xml (so it can be easily changed manually if needed). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825863#comment-17825863 ] Markus Jelsma commented on NUTCH-3032: -- No idea what git fork is supposed to do, maybe it should be a git branch instead. I am not an skilled Git user, but you can always attach a patch to this ticket. > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3031) ProtocolFactory host mapper to support domains
[ https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3031. -- Resolution: Fixed 83acd501e..c390dfc8b master -> master > ProtocolFactory host mapper to support domains > -- > > Key: NUTCH-3031 > URL: https://issues.apache.org/jira/browse/NUTCH-3031 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-3031.patch > > > Currently ProtocolFactory supports different protocol plugins based on the > host configured for it. This patch will add support for listing domains as > well so you don't have to list numerous subdomains for one larger domain. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3031) ProtocolFactory host mapper to support domains
[ https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3031: - Attachment: NUTCH-3031.patch > ProtocolFactory host mapper to support domains > -- > > Key: NUTCH-3031 > URL: https://issues.apache.org/jira/browse/NUTCH-3031 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-3031.patch > > > Currently ProtocolFactory supports different protocol plugins based on the > host configured for it. This patch will add support for listing domains as > well so you don't have to list numerous subdomains for one larger domain. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3031) ProtocolFactory host mapper to support domains
Markus Jelsma created NUTCH-3031: Summary: ProtocolFactory host mapper to support domains Key: NUTCH-3031 URL: https://issues.apache.org/jira/browse/NUTCH-3031 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.20 Currently ProtocolFactory supports different protocol plugins based on the host configured for it. This patch will add support for listing domains as well so you don't have to list numerous subdomains for one larger domain. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818531#comment-17818531 ] Markus Jelsma commented on NUTCH-3030: -- For some reason the attached patch did not apply cleanly (error on line 96), added new patch that does apply without complaining. > Update default TLS cipher suites for http(s) protocol > - > > Key: NUTCH-3030 > URL: https://issues.apache.org/jira/browse/NUTCH-3030 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch > > > If http.tls.supported.cipher.suites is not set in the configuration, it > defaults to a hard-coded list which is not exhaustive enough. I have > encountered websites that exclusively use ciphers which are not included, so > they could not be handled by protocol-http. > I changed this list to the system default -- SSLSocketFactory's > .getDefaultCipherSuites() to be precise. One could also use > .getSupportedCipherSuites() here, I suppose. > The original list should be moved to nutch-default.xml or omitted altogether. > The protocol list is still hard-coded, but it is now also added to > nutch-default.xml (so it can be easily changed manually if needed). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3030: - Attachment: NUTCH-3030.patch > Update default TLS cipher suites for http(s) protocol > - > > Key: NUTCH-3030 > URL: https://issues.apache.org/jira/browse/NUTCH-3030 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch > > > If http.tls.supported.cipher.suites is not set in the configuration, it > defaults to a hard-coded list which is not exhaustive enough. I have > encountered websites that exclusively use ciphers which are not included, so > they could not be handled by protocol-http. > I changed this list to the system default -- SSLSocketFactory's > .getDefaultCipherSuites() to be precise. One could also use > .getSupportedCipherSuites() here, I suppose. > The original list should be moved to nutch-default.xml or omitted altogether. > The protocol list is still hard-coded, but it is now also added to > nutch-default.xml (so it can be easily changed manually if needed). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3030: Assignee: Markus Jelsma > Update default TLS cipher suites for http(s) protocol > - > > Key: NUTCH-3030 > URL: https://issues.apache.org/jira/browse/NUTCH-3030 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: default_ciphers_and_protocols-2.patch > > > If http.tls.supported.cipher.suites is not set in the configuration, it > defaults to a hard-coded list which is not exhaustive enough. I have > encountered websites that exclusively use ciphers which are not included, so > they could not be handled by protocol-http. > I changed this list to the system default -- SSLSocketFactory's > .getDefaultCipherSuites() to be precise. One could also use > .getSupportedCipherSuites() here, I suppose. > The original list should be moved to nutch-default.xml or omitted altogether. > The protocol list is still hard-coded, but it is now also added to > nutch-default.xml (so it can be easily changed manually if needed). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3029: Assignee: Markus Jelsma > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815345#comment-17815345 ] Markus Jelsma commented on NUTCH-3028: -- New patch: when expression was not set, an exception was raised. > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-1.patch > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814731#comment-17814731 ] Markus Jelsma commented on NUTCH-3028: -- Any objections to this one before i get it in? > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata are exported to WARC. {color:#00}-expr 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3027.patch > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: (was: NUTCH-3027.patch) > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028.patch > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-3028.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3028) WARCExported to support filtering by JEXL
Markus Jelsma created NUTCH-3028: Summary: WARCExported to support filtering by JEXL Key: NUTCH-3028 URL: https://issues.apache.org/jira/browse/NUTCH-3028 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3027 started by Markus Jelsma. > Trivial resource leak patch in DomainSuffixes.java > -- > > Key: NUTCH-3027 > URL: https://issues.apache.org/jira/browse/NUTCH-3027 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.20 >Reporter: Sascha Kehrli >Assignee: Markus Jelsma >Priority: Trivial > Original Estimate: 1m > Remaining Estimate: 1m > > Found a trivial resource leak in .../util/DomainSuffixes.java, where an > InputStream is not closed: > {code:java} > InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file); > try { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > > instead of: > {code:java} > try (InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file)) { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > Where the InputStream is automatically closed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808614#comment-17808614 ] Markus Jelsma commented on NUTCH-3027: -- Thanks Sascha Kehrli! Committed {color:#00}85fea6e46..6b0455454 master -> master{color} > Trivial resource leak patch in DomainSuffixes.java > -- > > Key: NUTCH-3027 > URL: https://issues.apache.org/jira/browse/NUTCH-3027 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.20 >Reporter: Sascha Kehrli >Assignee: Markus Jelsma >Priority: Trivial > Original Estimate: 1m > Remaining Estimate: 1m > > Found a trivial resource leak in .../util/DomainSuffixes.java, where an > InputStream is not closed: > {code:java} > InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file); > try { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > > instead of: > {code:java} > try (InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file)) { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > Where the InputStream is automatically closed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3027. -- Fix Version/s: 1.20 Resolution: Fixed > Trivial resource leak patch in DomainSuffixes.java > -- > > Key: NUTCH-3027 > URL: https://issues.apache.org/jira/browse/NUTCH-3027 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.20 >Reporter: Sascha Kehrli >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.20 > > Original Estimate: 1m > Remaining Estimate: 1m > > Found a trivial resource leak in .../util/DomainSuffixes.java, where an > InputStream is not closed: > {code:java} > InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file); > try { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > > instead of: > {code:java} > try (InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file)) { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > Where the InputStream is automatically closed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3027: Assignee: Markus Jelsma > Trivial resource leak patch in DomainSuffixes.java > -- > > Key: NUTCH-3027 > URL: https://issues.apache.org/jira/browse/NUTCH-3027 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.20 >Reporter: Sascha Kehrli >Assignee: Markus Jelsma >Priority: Trivial > Original Estimate: 1m > Remaining Estimate: 1m > > Found a trivial resource leak in .../util/DomainSuffixes.java, where an > InputStream is not closed: > {code:java} > InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file); > try { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > > instead of: > {code:java} > try (InputStream input = > this.getClass().getClassLoader().getResourceAsStream(file)) { > new DomainSuffixesReader().read(this, input); > } catch (Exception ex) { > LOG.warn(StringUtils.stringifyException(ex)); > } {code} > Where the InputStream is automatically closed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771023#comment-17771023 ] Markus Jelsma commented on NUTCH-1635: -- Good point! No, we haven't seen this behaviour for the past decade or so. Let's close it! Danke! > New crawldb sometimes ends up in current > > > Key: NUTCH-1635 > URL: https://issues.apache.org/jira/browse/NUTCH-1635 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Priority: Major > > In some weird cases the newly created crawldb by updatedb ends up in > crawl/crawldb/current//. So instead of replacing current/, it ends up > inside current/! This causes the generator to fail. > It's impossible to reliably reproduce the problem. It only happened a couple > of times in the last few years. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-1635) New crawldb sometimes ends up in current
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1635. Resolution: Not A Problem > New crawldb sometimes ends up in current > > > Key: NUTCH-1635 > URL: https://issues.apache.org/jira/browse/NUTCH-1635 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Priority: Major > > In some weird cases the newly created crawldb by updatedb ends up in > crawl/crawldb/current//. So instead of replacing current/, it ends up > inside current/! This causes the generator to fail. > It's impossible to reliably reproduce the problem. It only happened a couple > of times in the last few years. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3007) Fix impossible casts
[ https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769989#comment-17769989 ] Markus Jelsma commented on NUTCH-3007: -- +1 yes! > Fix impossible casts > > > Key: NUTCH-3007 > URL: https://issues.apache.org/jira/browse/NUTCH-3007 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Spotbugs reports two occurrences of > Impossible cast from java.util.ArrayList to String[] in > org.apache.nutch.fetcher.Fetcher.run(Map, String) > Both were introduced later into the {{run(Map args, String > crawlId)}} method and obviously never used (would throw a > ClassCastException). The code blocks should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769988#comment-17769988 ] Markus Jelsma commented on NUTCH-2852: -- Seems just fine for these files +1 > Method invokes System.exit(...) 9 bugs > -- > > Key: NUTCH-2852 > URL: https://issues.apache.org/jira/browse/NUTCH-2852 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.18 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > org.apache.nutch.indexer.IndexingFiltersChecker since first historized release > In class org.apache.nutch.indexer.IndexingFiltersChecker > In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) > At IndexingFiltersChecker.java:[line 96] > Another occurrence at IndexingFiltersChecker.java:[line 129] > org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes > System.exit(...), which shuts down the entire virtual machine > Invoking System.exit shuts down the entire Java virtual machine. This should > only been done when it is appropriate. Such calls make it hard or impossible > for your code to be invoked by other code. Consider throwing a > RuntimeException instead. > Also occurs in >org.apache.nutch.net.URLFilterChecker since first historized release >org.apache.nutch.net.URLNormalizerChecker since first historized release >org.apache.nutch.parse.ParseSegment since first historized release >org.apache.nutch.parse.ParserChecker since first historized release >org.apache.nutch.service.NutchServer since first historized release >org.apache.nutch.tools.CommonCrawlDataDumper since first historized release >org.apache.nutch.tools.DmozParser since first historized release >org.apache.nutch.util.AbstractChecker since first historized release -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766306#comment-17766306 ] Markus Jelsma commented on NUTCH-2978: -- Thanks for picking it up. I am very happy this one is resolved now. Thanks Sebastian for testing! > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764699#comment-17764699 ] Markus Jelsma commented on NUTCH-2978: -- You managed to get it up and running, as well when deployed on Hadoop? This ticket almost drove me to tears and despair :D > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764697#comment-17764697 ] Markus Jelsma commented on NUTCH-3000: -- Yes, this is a bit odd indeed. +1 > protocol-selenium returns only the body,strips off the element > -- > > Key: NUTCH-3000 > URL: https://issues.apache.org/jira/browse/NUTCH-3000 > Project: Nutch > Issue Type: Bug > Components: protocol >Reporter: Tim Allison >Priority: Major > > The selenium protocol returns only the body portion of the html, which means > that neither the title nor the other page metadata in the section > gets extracted. > {noformat} > String innerHtml = driver.findElement(By.tagName("body")) > .getAttribute("innerHTML"); > {noformat} > We should return the full html, no? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2999) Update Lucene version to latest 8.x
[ https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760522#comment-17760522 ] Markus Jelsma commented on NUTCH-2999: -- Seems fine +1 > Update Lucene version to latest 8.x > --- > > Key: NUTCH-2999 > URL: https://issues.apache.org/jira/browse/NUTCH-2999 > Project: Nutch > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > It may be the way that I'm loading the project, but, for me, Intellij really > does not like the Lucene version conflict between {{scoring-similarity}} and > the OpenSearch/Elasticsearch modules. > Can we bump Lucene to the latest 8.11.2 throughout? > PR for review incoming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738191#comment-17738191 ] Markus Jelsma commented on NUTCH-2993: -- To be honest, i am not too happy with the implementation like this. Ideally we would regex all outlinks, but that will be even more costly. The crawler still ends up in bad sections of the site and further on the www, but with low depth settings, it is manageable. > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738047#comment-17738047 ] Markus Jelsma commented on NUTCH-2993: -- Thanks Sebastian! # changed the checks again. # check for empty/non-configured pattern in place # props added to config # try/catch in place # typo removed Ran a few long crawls with just over a hundred domains. I changed the checks again. Now the maxDepth resets if it does NOT match/find the pattern. There was still a possibility of sitemap-like pages being passed an overridden maxDepth, due to a linking page matching the pattern, and then a whole site got crawled anyway. > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15-1.patch > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993.patch) > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993.patch > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Description: We do not want some crawl to go deep and broad, but instead focus it on a narrow section of sites. This patch overrides maxDepth for outlinks of URLs matching a configured pattern. URL not matching the pattern get the default max depth value configured. was: We do not want some crawl to go deep and broad, but instead focus it on a narrow section of sites. This patch skips the depth check if the current URL matches some regular expression. Initially we tried to set a custom maxDepth based on a Pattern match, but this didn't work. The crawler still managed to creep too deep due to having links everywhere. > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch skips the depth check if the current URL > matches some regular expression. > > Initially we tried to set a custom maxDepth based on a Pattern match, but > this didn't work. The crawler still managed to creep too deep due to having > links everywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15-1.patch) > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch skips the depth check if the current URL > matches some regular expression. > > Initially we tried to set a custom maxDepth based on a Pattern match, but > this didn't work. The crawler still managed to creep too deep due to having > links everywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch skips the depth check if the current URL > matches some regular expression. > > Initially we tried to set a custom maxDepth based on a Pattern match, but > this didn't work. The crawler still managed to creep too deep due to having > links everywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15-1.patch > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch skips the depth check if the current URL > matches some regular expression. > > Initially we tried to set a custom maxDepth based on a Pattern match, but > this didn't work. The crawler still managed to creep too deep due to having > links everywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Description: We do not want some crawl to go deep and broad, but instead focus it on a narrow section of sites. This patch skips the depth check if the current URL matches some regular expression. Initially we tried to set a custom maxDepth based on a Pattern match, but this didn't work. The crawler still managed to creep too deep due to having links everywhere. was:We do not want some crawl to go deep and broad, but instead focus it on a narrow section of sites. This patch allows for a overridden max depth if the current URL matches against a Pattern. If find(), all outlinks are given a new max depth. > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch skips the depth check if the current URL > matches some regular expression. > > Initially we tried to set a custom maxDepth based on a Pattern match, but > this didn't work. The crawler still managed to creep too deep due to having > links everywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Summary: ScoringDepth plugin to skip depth check based on URL Pattern (was: ScoringDepth plugin to override maxDepth based on URL Pattern) > ScoringDepth plugin to skip depth check based on URL Pattern > > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch allows for a overridden max depth if the > current URL matches against a Pattern. If find(), all outlinks are given a > new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on URL Pattern > - > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch allows for a overridden max depth if the > current URL matches against a Pattern. If find(), all outlinks are given a > new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to override maxDepth based on URL Pattern > - > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch allows for a overridden max depth if the > current URL matches against a Pattern. If find(), all outlinks are given a > new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to override maxDepth based on URL Pattern > - > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch allows for a overridden max depth if the > current URL matches against a Pattern. If find(), all outlinks are given a > new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on URL Pattern > - > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch allows for a overridden max depth if the > current URL matches against a Pattern. If find(), all outlinks are given a > new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730535#comment-17730535 ] Markus Jelsma commented on NUTCH-2993: -- Here's a simple patch against Nutch 1.15. Will patch for master later. > ScoringDepth plugin to override maxDepth based on URL Pattern > - > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch allows for a overridden max depth if the > current URL matches against a Pattern. If find(), all outlinks are given a > new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on URL Pattern > - > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch allows for a overridden max depth if the > current URL matches against a Pattern. If find(), all outlinks are given a > new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern
Markus Jelsma created NUTCH-2993: Summary: ScoringDepth plugin to override maxDepth based on URL Pattern Key: NUTCH-2993 URL: https://issues.apache.org/jira/browse/NUTCH-2993 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.20 We do not want some crawl to go deep and broad, but instead focus it on a narrow section of sites. This patch allows for a overridden max depth if the current URL matches against a Pattern. If find(), all outlinks are given a new max depth. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2985) Disable plugin urlfilter-validator by default
[ https://issues.apache.org/jira/browse/NUTCH-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693291#comment-17693291 ] Markus Jelsma commented on NUTCH-2985: -- +1 > Disable plugin urlfilter-validator by default > - > > Key: NUTCH-2985 > URL: https://issues.apache.org/jira/browse/NUTCH-2985 > Project: Nutch > Issue Type: Bug > Components: configuration, urlfilter >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > The plugin urlfilter-validator is activated by default (in nutch-default.xml) > but has two major issues which may confuse users of Nutch: > - single-part domain names (localhost, etc.) are not allowed (NUTCH-2973) > - IPv6 host names are rejected as invalid (NUTCH-2705) > What about disabling it by default to overcome these issues? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2974) Ant build fails with "Unparseable date" on certain platforms
[ https://issues.apache.org/jira/browse/NUTCH-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677435#comment-17677435 ] Markus Jelsma commented on NUTCH-2974: -- Sounds like a nice solution for this obscure bug +1 > Ant build fails with "Unparseable date" on certain platforms > > > Key: NUTCH-2974 > URL: https://issues.apache.org/jira/browse/NUTCH-2974 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > When touching the configuration templates the ant build fails on certain > platforms, see NUTCH-2512 and recently by [Kamil Mroczek on the users > list|https://lists.apache.org/thread/dc36ofc6kvvx3fxlqbnzqdcp73yjcj8m], > including a fix. > However, we should also consider removing the "touch" action if it's not > clear what the purpose of it is - it's there since the initial import of the > Nutch source code to the Apache repository. Could be obsolete now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.
[ https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655383#comment-17655383 ] Markus Jelsma commented on NUTCH-2634: -- +1 > Some links marked as "nofollow" are followed anyway. > > > Key: NUTCH-2634 > URL: https://issues.apache.org/jira/browse/NUTCH-2634 > Project: Nutch > Issue Type: Bug >Reporter: Gerard Bouchar >Priority: Major > Fix For: 1.20 > > > In order to check if an outlink in an tag can be followed, nutch checks > whether the value of its rel attribute is the exact string string "nofollow". > However, [the rel attribute can contain a list of link > types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], > all of which should be respected. > So nutch rightfully doesn't follow a link like: > {code:html} > DO NOT FOLLOW THIS LINK > {code} > but wrongfully follows : > {code:html} > DO NOT FOLLOW THIS > LINK > {code} > Because of the code duplication in nutch's html parsers, this should be fixed > in two places: > # > [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437] > # > [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243 ] Markus Jelsma edited comment on NUTCH-2978 at 12/22/22 11:33 AM: - Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, although it does work. We now get: {color:#00}java.util.ServiceConfigurationError: org.apache.logging.log4j.spi.Provider: org.apache.logging.log4j.core.impl.Log4jProvider not a subtype {color} {color:#00}There are no multiple versions of the same logging JARs anywhere on the classpath.{color} was (Author: markus17): Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, although it does work. We now get: {color:#00}java.util.ServiceConfigurationError: org.apache.logging.log4j.spi.Provider: org.apache.logging.log4j.core.impl.Log4jProvider not a subtype{color} > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243 ] Markus Jelsma commented on NUTCH-2978: -- Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, although it does work. We now get: {color:#00}java.util.ServiceConfigurationError: org.apache.logging.log4j.spi.Provider: org.apache.logging.log4j.core.impl.Log4jProvider not a subtype{color} > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648636#comment-17648636 ] Markus Jelsma commented on NUTCH-2978: -- New patch now makes sure there is a log4j 2.19 in tika and mentioned in its plugin.xml, otherwise above will happen. Now i am not sure the other plugins are still ok. > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-3.patch > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648633#comment-17648633 ] Markus Jelsma commented on NUTCH-2978: -- Ok, i also wanted to get rid of loose log4j libs. There was still one in any23 and parse-tika. When removing the lib from parse-tika, lots of bad things happen. {code:java} 22/12/16 13:36:03 WARN ooxml.OPCPackageDetector: Unable to load org.apache.tika.detect.microsoft.ooxml.OPCPackageDetector java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager at org.apache.poi.ooxml.POIXMLRelation.(POIXMLRelation.java:54) at org.apache.tika.detect.microsoft.ooxml.OPCPackageDetector.(OPCPackageDetector.java:106) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at java.base/java.lang.Class.newInstance(Class.java:584) at org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80) at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345) at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:312) at org.apache.tika.detect.zip.DefaultZipContainerDetector.(DefaultZipContainerDetector.java:85) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:78) at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345) at org.apache.tika.detect.DefaultDetector.getDefaultDetectors(DefaultDetector.java:90) at org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:50) at org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:55) at org.apache.tika.config.TikaConfig.getDefaultDetector(TikaConfig.java:264) at org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:1017) at org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:975) at org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:630) at org.apache.tika.config.TikaConfig.(TikaConfig.java:155) at org.apache.tika.config.TikaConfig.(TikaConfig.java:145) at org.apache.tika.config.TikaConfig.(TikaConfig.java:120) at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:276) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:177) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:75) at org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:245) at org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:87) at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:136) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:316) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.hadoop.util.RunJar.run(RunJar.java:323) at org.apache.hadoop.util.RunJar.main(RunJar.java:236) Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:105) at org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:93)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648625#comment-17648625 ] Markus Jelsma commented on NUTCH-2978: -- Patch now includes Sebastian's patch, and actually contains the upgrade from old slf4j to the new 2.0.6. Tested on Hadoop 3.3.4 cluster with a parsing fetcher. This went just fine. -I must admist that those slf4js and jcl-over-slf remaining in the plugins do bother me to some degree.- New patch now includes exclusions to get rid of all of them. > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-2.patch > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, > NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: (was: NUTCH-2978-1.patch) > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, > NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-1.patch > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, > NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-1.patch > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, > NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648060#comment-17648060 ] Markus Jelsma commented on NUTCH-2978: -- Ah yes, thanks! I am not sure if a 'solution' will come from Tika, that specific package seems to be shaded in all versions between 2.3.0 and 2.6.0. But, we, ASF Nutch, do not depend on it so we are good. Patched like this, Nutch will fetch/parse just fine when running on Hadoop. I did get this when doing an indexchecker using the job file: {color:#00}ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...{color} {color:#00}However, logging worked just fine.{color} > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646681#comment-17646681 ] Markus Jelsma commented on NUTCH-2978: -- About the slf issues, Somewhere another slf4j jar was lurking in the job file, but i couldn't find it for a long while. Until i saw there was a slf4j jar packaged within the tika-parser-scientific-package! I got rid of it, then got a xerces/xml-apis error, which i then also excluded. Now there are many other errors. Something to look out for when upgrading Tika. But for some reason, although we are using the same Tika version, that specific package does not appear as a dependency of Tika in in Nutch' vanilla. That may change later. > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978-any23.patch, NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2924) Generate maxCount expr evaluated only once
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2924. -- Resolution: Fixed {color:#00}To https://gitbox.apache.org/repos/asf/nutch.git {color} d806aa450..7d3900450 master -> master > Generate maxCount expr evaluated only once > -- > > Key: NUTCH-2924 > URL: https://issues.apache.org/jira/browse/NUTCH-2924 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.16 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, > NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch > > > The generate.maxCount expression is evaluated only once in the generator's > reducer, instead, it must be set once per host. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838 ] Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:34 PM: --- Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just this patch, and the generator patch. It works! Not sure why our parser stuff fails, but at least Nutch' stuff is working! But we both use a LoggerFactory.getLogger invocation, the original TikaParser invocation works, mine doesn't. was (Author: markus17): Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just this patch, and the generator patch. It works! Not sure why our parser stuff fails, but at least Nutch' stuff is working! > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838 ] Markus Jelsma commented on NUTCH-2978: -- Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just this patch, and the generator patch. It works! Not sure why our parser stuff fails, but at least Nutch' stuff is working! > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825 ] Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:12 PM: --- This morning i saw one of our internal projects spewing the same error as any23, it was quickly remedied by upgrading a dependency further down the line. Not sure if this will go as easy with the any23 plugin, i'll take a look Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a generate, followed by a fetch. The fetch failed due to LinkageError in our parser plugin, similar as parse-tika. Too bad. A local indexchecker runs fine, an indexchecker using a job file fails with the same error. Removing all reload4j references is not solving it, as expected. Not sure what to do now. was (Author: markus17): This morning i saw one of our internal projects spewing the same error as any23, it was quickly remedied by upgrading a dependency further down the line. Not sure if this will go as easy with the any23 plugin, i'll take a look Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a generate, followed by a fetch. The fetch failed due to LinkageError in our parser plugin, similar as parse-tika. Too bad. A local indexchecker runs fine, an indexchecker using a job file fails with the same error. > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825 ] Markus Jelsma commented on NUTCH-2978: -- This morning i saw one of our internal projects spewing the same error as any23, it was quickly remedied by upgrading a dependency further down the line. Not sure if this will go as easy with the any23 plugin, i'll take a look Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a generate, followed by a fetch. The fetch failed due to LinkageError in our parser plugin, similar as parse-tika. Too bad. A local indexchecker runs fine, an indexchecker using a job file fails with the same error. > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644808#comment-17644808 ] Markus Jelsma commented on NUTCH-2924: -- Here's the proper patch, finally. > Generate maxCount expr evaluated only once > -- > > Key: NUTCH-2924 > URL: https://issues.apache.org/jira/browse/NUTCH-2924 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.16 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, > NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch > > > The generate.maxCount expression is evaluated only once in the generator's > reducer, instead, it must be set once per host. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-5.patch > Generate maxCount expr evaluated only once > -- > > Key: NUTCH-2924 > URL: https://issues.apache.org/jira/browse/NUTCH-2924 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.16 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, > NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch > > > The generate.maxCount expression is evaluated only once in the generator's > reducer, instead, it must be set once per host. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644491#comment-17644491 ] Markus Jelsma commented on NUTCH-2978: -- Yes, i saw the slf4j present in the plugin, it troubled my already when i attempted an upgrade to a newer Tika version. Regarding reload4j, i was already worried it might not run in distributed mode but haven't tested it yet. For now i am glad enough Nutch runs our Tika based parser in local mode. To be continued > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644489#comment-17644489 ] Markus Jelsma commented on NUTCH-2924: -- Yes, that is expected. This patch requires a hostdb to be configured and present, i will add a check for that. > Generate maxCount expr evaluated only once > -- > > Key: NUTCH-2924 > URL: https://issues.apache.org/jira/browse/NUTCH-2924 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.16 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, > NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924.patch > > > The generate.maxCount expression is evaluated only once in the generator's > reducer, instead, it must be set once per host. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2977) Support for showing dependency tree
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2977. -- Fix Version/s: 1.20 Resolution: Fixed > Support for showing dependency tree > --- > > Key: NUTCH-2977 > URL: https://issues.apache.org/jira/browse/NUTCH-2977 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2977.patch > > > I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and > especially reload4j. I desperately need this function for that. > > $ ant dependencytree > > will now show the tree. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2977) Support for showing dependency tree
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644436#comment-17644436 ] Markus Jelsma commented on NUTCH-2977: -- {color:#00}Committed:{color} ed7b6615b..d806aa450 master -> master > Support for showing dependency tree > --- > > Key: NUTCH-2977 > URL: https://issues.apache.org/jira/browse/NUTCH-2977 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Attachments: NUTCH-2977.patch > > > I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and > especially reload4j. I desperately need this function for that. > > $ ant dependencytree > > will now show the tree. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Description: I got in trouble upgrading some dependencies and got a lot of LinkageErrors today, or with a Tika upgrade, disappearing logs. This patch fixes that by moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old log4j -> reload4j. This patch fixes it. was: I got in trouble upgrading some dependencies and got a lot of LinkageErrors today, or with a Tika upgrade, disappearing logs. This patch fixes that by moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old log4j -> reload4j. > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > This patch fixes it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Description: I got in trouble upgrading some dependencies and got a lot of LinkageErrors today, or with a Tika upgrade, disappearing logs. This patch fixes that by moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old log4j -> reload4j. > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > > I got in trouble upgrading some dependencies and got a lot of LinkageErrors > today, or with a Tika upgrade, disappearing logs. This patch fixes that by > moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old > log4j -> reload4j. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978.patch > Move to slf4j2 and remove log4j1 and reload4j > - > > Key: NUTCH-2978 > URL: https://issues.apache.org/jira/browse/NUTCH-2978 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2978.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j
Markus Jelsma created NUTCH-2978: Summary: Move to slf4j2 and remove log4j1 and reload4j Key: NUTCH-2978 URL: https://issues.apache.org/jira/browse/NUTCH-2978 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Attachments: NUTCH-2978.patch -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2977) Support for showing dependency tree
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2977: - Attachment: NUTCH-2977.patch > Support for showing dependency tree > --- > > Key: NUTCH-2977 > URL: https://issues.apache.org/jira/browse/NUTCH-2977 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Attachments: NUTCH-2977.patch > > > I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and > especially reload4j. I desperately need this function for that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2977) Support for showing dependency tree
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2977: - Description: I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and especially reload4j. I desperately need this function for that. $ and dependencytree will now show the tree. was:I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and especially reload4j. I desperately need this function for that. > Support for showing dependency tree > --- > > Key: NUTCH-2977 > URL: https://issues.apache.org/jira/browse/NUTCH-2977 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Attachments: NUTCH-2977.patch > > > I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and > especially reload4j. I desperately need this function for that. > > $ and dependencytree > > will now show the tree. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-2977) Support for showing dependency tree
Markus Jelsma created NUTCH-2977: Summary: Support for showing dependency tree Key: NUTCH-2977 URL: https://issues.apache.org/jira/browse/NUTCH-2977 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and especially reload4j. I desperately need this function for that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644298#comment-17644298 ] Markus Jelsma commented on NUTCH-2924: -- Updated patch for master. > Generate maxCount expr evaluated only once > -- > > Key: NUTCH-2924 > URL: https://issues.apache.org/jira/browse/NUTCH-2924 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.16 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, > NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924.patch > > > The generate.maxCount expression is evaluated only once in the generator's > reducer, instead, it must be set once per host. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-4.patch > Generate maxCount expr evaluated only once > -- > > Key: NUTCH-2924 > URL: https://issues.apache.org/jira/browse/NUTCH-2924 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.16 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, > NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924.patch > > > The generate.maxCount expression is evaluated only once in the generator's > reducer, instead, it must be set once per host. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2973) Single domain names (eg https://localnet) can't be crawled - filtering fails
[ https://issues.apache.org/jira/browse/NUTCH-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621023#comment-17621023 ] Markus Jelsma commented on NUTCH-2973: -- Hello David, By default urlfilter-validator is an active plugin, and it will reject everything localhost, or localnet. You can disable the plugin in your plugin.includes configuration directive. > Single domain names (eg https://localnet) can't be crawled - filtering fails > > > Key: NUTCH-2973 > URL: https://issues.apache.org/jira/browse/NUTCH-2973 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.19 > Environment: Nutch 1.19, checked on Windows 10 and Ubuntu. Both have > the same issue. > 'm trying to crawl a SharePoint intranet using nutch where the URLs are > similar to: > > {{https://localnet/something.aspx}} > The issue is that Nutch is rejecting any url with a single element domain > name such as localnet above. "localnet.com" is not rejected, nor is > "local.localnet". It almost feels as if there's a chunk of code within Nutch > that's unrelated to the filtering mechanisms that rejects URLs outright if > they don't have a WWW style format and a WWW-style domain such as .COM > Error message: > > {{Total urls rejected by filters: 1}} > I've checked and updated all the _filter_ files in the conf directory. Even > making then incredibly permissive (effectively "crawl everything") has not > helped. >Reporter: David Smith >Priority: Blocker > > There appears to be a bug within the core of Nutch that fails to permit any > single domain name URLs to be crawled. Example: > {{https://{*}localnet{*}/something.aspx}} > The issue is that Nutch is rejecting any url with a single element domain > name such as *localnet* above. "localnet.com" is not rejected, nor is > "local.localnet". It almost feels as if there's a chunk of code within Nutch > that's unrelated to the filtering mechanisms that rejects URLs outright if > they don't have a WWW style format and a WWW-style domain such as .COM > Error message: > {{Total urls rejected by filters: 1}} > I've checked and updated all the filter files in the conf directory. Even > making then incredibly permissive (effectively "crawl everything") has not > helped. Immediately that a dot (.) is added to the domain name it is not > rejected - eg blah.localnet. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2969) Javadoc: Javascript search is not working when built on JDK 11
[ https://issues.apache.org/jira/browse/NUTCH-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582978#comment-17582978 ] Markus Jelsma commented on NUTCH-2969: -- Nice! > Javadoc: Javascript search is not working when built on JDK 11 > -- > > Key: NUTCH-2969 > URL: https://issues.apache.org/jira/browse/NUTCH-2969 > Project: Nutch > Issue Type: Bug > Components: build, documentation >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > Newer Java versions create a nice Javascript-backed search box on the > Javadocs. When built on JDK 11 clicking on search results leads to 404 > results with a {{/undefined/}} path element in the URL. This is caused by > [JDK-8215291|https://bugs.openjdk.org/browse/JDK-8215291] - Nutch does not > use Java modules. > The issue can be solved by adding {{--no-module-directories}} as argument to > the {{javadoc}} command. However, this only works for JDK 11 and will break > the javadoc build for later JDK versions, see > [JDK-8215582|https://bugs.openjdk.org/browse/JDK-8215582]. > See also > - > https://stackoverflow.com/questions/52326318/maven-javadoc-search-redirects-to-undefined-url > - https://github.com/crawler-commons/crawler-commons/pull/380 > Notes: > - seen while preparing a release candidate for 1.19 (will quickly commit the > solution) > - adding {{--no-module-directories}} should be conditional for JDK 11 only > -* ant: see [condition|https://ant.apache.org/manual/Tasks/condition.html] > -* gradle: > {noformat} > if (JavaVersion.current().isJava11()) { > options.addBooleanOption("-no-module-directories", true) > } > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582859#comment-17582859 ] Markus Jelsma commented on NUTCH-2960: -- Yes, this would be much preferred over removing the binaries from the distribution! > indexer-elastic: remove plugin from binary package to address licensing issues > -- > > Key: NUTCH-2960 > URL: https://issues.apache.org/jira/browse/NUTCH-2960 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > The license of Elasticsearch has changed with v7.11.0 and upwards and is (if > correct) not more compatible with the Apache license. Accordingly, we should > not further ship Elastic jars with the binary package. > It should be possible to keep the indexer-elastic plugin in the source > package as an [optional|https://www.apache.org/legal/resolved.html#optional] > dependency (indexer-solr is the default indexing backend and more are > available). -- This message was sent by Atlassian Jira (v8.20.10#820010)