[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may

[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may

[jira] [Created] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3056: Summary: Injector to support resolving seed URLs Key: NUTCH-3056 URL: https://issues.apache.org/jira/browse/NUTCH-3056 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384 ] Markus Jelsma commented on NUTCH-3028: -- Ok, the Content object is now also available

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-2.patch > WARCExported to support filtering by J

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next

[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836133#comment-17836133 ] Markus Jelsma commented on NUTCH-3039: -- Thanks for spotting that! > Failure to handle ftp:// U

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827048#comment-17827048 ] Markus Jelsma commented on NUTCH-3029: -- comment describing throws is also required these days

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826823#comment-17826823 ] Markus Jelsma commented on NUTCH-3029: -- throws was missing too    84cda2abd..a8ec17ca8  master

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826783#comment-17826783 ] Markus Jelsma commented on NUTCH-3029: -- Thanks Lewis!    5ba50c0c6..84cda2abd  master -> mas

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826759#comment-17826759 ] Markus Jelsma commented on NUTCH-3029: --    4f62dec0f..5ba50c0c6  master -> master actual cha

[jira] [Commented] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826760#comment-17826760 ] Markus Jelsma commented on NUTCH-3033: -- Ah, the new ivy works like a charm! Thanks! > Upgrade

[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3029. -- Resolution: Fixed Thanks Martin!    551c50b1c..4642c30c2  master -> master > Host sp

[jira] [Resolved] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3030. -- Resolution: Fixed 42b55f6a9..551c50b1c  master -> master   Thanks Martin!   > Use

[jira] [Updated] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3030: - Summary: Use system default cipher suites instead of hard-coded set (was: Update default TLS

[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825863#comment-17825863 ] Markus Jelsma commented on NUTCH-3032: -- No idea what git fork is supposed to do, maybe it should

[jira] [Resolved] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-12 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3031. -- Resolution: Fixed    83acd501e..c390dfc8b  master -> master > ProtocolFactory host

Re: [DISCUSS] Release Nutch 1.20

2024-03-10 Thread Markus Jelsma
Good idea! I'll finish work on three open issues the next week. Op za 9 mrt 2024 om 13:02 schreef Sebastian Nagel < wastl.na...@googlemail.com>: > Hi Lewis, > > yes, of course! > > Some points we should do before the release: > > - address the ES licensing issue, >the easiest way is to

[jira] [Updated] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3031: - Attachment: NUTCH-3031.patch > ProtocolFactory host mapper to support doma

[jira] [Created] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3031: Summary: ProtocolFactory host mapper to support domains Key: NUTCH-3031 URL: https://issues.apache.org/jira/browse/NUTCH-3031 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818531#comment-17818531 ] Markus Jelsma commented on NUTCH-3030: -- For some reason the attached patch did not apply cleanly

[jira] [Updated] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3030: - Attachment: NUTCH-3030.patch > Update default TLS cipher suites for http(s) proto

[jira] [Assigned] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3030: Assignee: Markus Jelsma > Update default TLS cipher suites for http(s) proto

[jira] [Assigned] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3029: Assignee: Markus Jelsma > Host specific max. and min. intervals in adaptive schedu

Tika parsing error 1.19

2024-02-15 Thread Markus Jelsma
Hi, We were doing some tests with 1.19 and found that some sites became unparsable using Tika. At this moment i know of at least two sites causing this, my own, https://www.openindex.io/ and https://www.elzendaalcollege.nl/ 2024-02-15 12:33:49,639 WARN o.a.n.p.ParseUtil [main] Error parsing

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815345#comment-17815345 ] Markus Jelsma commented on NUTCH-3028: -- New patch: when expression was not set, an exception

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-1.patch > WARCExported to support filtering by J

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-06 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814731#comment-17814731 ] Markus Jelsma commented on NUTCH-3028: -- Any objections to this one before i get

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3027.patch > WARCExported to support filtering by J

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: (was: NUTCH-3027.patch) > WARCExported to support filtering by J

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028.patch > WARCExported to support filtering by J

[jira] [Created] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3028: Summary: WARCExported to support filtering by JEXL Key: NUTCH-3028 URL: https://issues.apache.org/jira/browse/NUTCH-3028 Project: Nutch Issue Type

[jira] [Work started] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3027 started by Markus Jelsma. > Trivial resource leak patch in DomainSuffixes.j

[jira] [Commented] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808614#comment-17808614 ] Markus Jelsma commented on NUTCH-3027: -- Thanks Sascha Kehrli! Committed  {color:#00}85fea6e46

[jira] [Resolved] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3027. -- Fix Version/s: 1.20 Resolution: Fixed > Trivial resource leak pa

[jira] [Assigned] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3027: Assignee: Markus Jelsma > Trivial resource leak patch in DomainSuffixes.j

[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771023#comment-17771023 ] Markus Jelsma commented on NUTCH-1635: -- Good point! No, we haven't seen this behaviour for the past

[jira] [Closed] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1635. Resolution: Not A Problem > New crawldb sometimes ends up in curr

[jira] [Commented] (NUTCH-3007) Fix impossible casts

2023-09-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769989#comment-17769989 ] Markus Jelsma commented on NUTCH-3007: -- +1 yes! > Fix impossible ca

[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769988#comment-17769988 ] Markus Jelsma commented on NUTCH-2852: -- Seems just fine for these files +1 > Method invo

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-18 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766306#comment-17766306 ] Markus Jelsma commented on NUTCH-2978: -- Thanks for picking it up. I am very happy this one

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764699#comment-17764699 ] Markus Jelsma commented on NUTCH-2978: -- You managed to get it up and running, as well when deployed

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764697#comment-17764697 ] Markus Jelsma commented on NUTCH-3000: -- Yes, this is a bit odd indeed. +1 > protocol-selen

[jira] [Commented] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760522#comment-17760522 ] Markus Jelsma commented on NUTCH-2999: -- Seems fine +1 > Update Lucene version to latest

[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738191#comment-17738191 ] Markus Jelsma commented on NUTCH-2993: -- To be honest, i am not too happy with the implementation

[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738047#comment-17738047 ] Markus Jelsma commented on NUTCH-2993: -- Thanks Sebastian! # changed the checks again. # check

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15-1.patch > ScoringDepth plugin to skip depth check based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993.patch) > ScoringDepth plugin to skip depth check based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993.patch > ScoringDepth plugin to skip depth check based on URL Patt

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Description: We do not want some crawl to go deep and broad, but instead focus it on a narrow

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to skip depth check based on URL Patt

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15-1.patch) > ScoringDepth plugin to skip depth check ba

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to skip depth check ba

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15-1.patch > ScoringDepth plugin to skip depth check based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Description: We do not want some crawl to go deep and broad, but instead focus it on a narrow

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Summary: ScoringDepth plugin to skip depth check based on URL Pattern (was: ScoringDepth plugin

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to override maxDepth ba

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to override maxDepth ba

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on

[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730535#comment-17730535 ] Markus Jelsma commented on NUTCH-2993: -- Here's a simple patch against Nutch 1.15. Will patch

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on

[jira] [Created] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2993: Summary: ScoringDepth plugin to override maxDepth based on URL Pattern Key: NUTCH-2993 URL: https://issues.apache.org/jira/browse/NUTCH-2993 Project: Nutch

[jira] [Commented] (NUTCH-2985) Disable plugin urlfilter-validator by default

2023-02-24 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693291#comment-17693291 ] Markus Jelsma commented on NUTCH-2985: -- +1 > Disable plugin urlfilter-validator by defa

Re: Upgrading Selenium

2023-01-20 Thread Markus Jelsma
> There must be a way, some how, some time. There isn't: https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141 Op do 19 jan. 2023 om 15:23 schreef Markus Jelsma < markus.jel...@openindex.io>: > > This makes some sense if you do not know anything about the UR

Re: Upgrading Selenium

2023-01-19 Thread Markus Jelsma
and other stuff not suitable for Selenium. Could make the HEAD request >optional. > > > merging the lib-selenium plugin with the protocol-selenium plugin > > I guess lib-selenium is to share common components between > protocol-selenium and > protocol-interactiveselenium. May

Re: Upgrading Selenium

2023-01-17 Thread Markus Jelsma
Hello Kamil, Yes, the plugin needs some upgrading indeed. We use a modern version of it elsewhere and it works really well, at least better than HtmlUnit. Besides that, the plugin also needs some overhaul. It currently first downloads the URL with HttpClient, and then, depending on MIME-type, it

[jira] [Commented] (NUTCH-2974) Ant build fails with "Unparseable date" on certain platforms

2023-01-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677435#comment-17677435 ] Markus Jelsma commented on NUTCH-2974: -- Sounds like a nice solution for this obscure bug +1 >

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-06 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655383#comment-17655383 ] Markus Jelsma commented on NUTCH-2634: -- +1 > Some links marked as "nofollow" are fo

[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243 ] Markus Jelsma edited comment on NUTCH-2978 at 12/22/22 11:33 AM: - Ah nope

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243 ] Markus Jelsma commented on NUTCH-2978: -- Ah nope, this is not it. Parse-tika throws lots of errors

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648636#comment-17648636 ] Markus Jelsma commented on NUTCH-2978: -- New patch now makes sure there is a log4j 2.19 in tika

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-3.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648633#comment-17648633 ] Markus Jelsma commented on NUTCH-2978: -- Ok, i also wanted to get rid of loose log4j libs

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648625#comment-17648625 ] Markus Jelsma commented on NUTCH-2978: -- Patch now includes Sebastian's patch, and actually contains

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-2.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: (was: NUTCH-2978-1.patch) > Move to slf4j2 and remove log4j1 and reloa

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-1.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-1.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-15 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648060#comment-17648060 ] Markus Jelsma commented on NUTCH-2978: -- Ah yes, thanks! I am not sure if a 'solution' will come from

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646681#comment-17646681 ] Markus Jelsma commented on NUTCH-2978: -- About the slf issues, Somewhere another slf4j jar

[jira] [Resolved] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-12 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2924. -- Resolution: Fixed {color:#00}To https://gitbox.apache.org/repos/asf/nutch.git {color

[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838 ] Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:34 PM: --- Ah, well. I

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838 ] Markus Jelsma commented on NUTCH-2978: -- Ah, well. I also tried a Tika parsing fetcher of a vanilla

[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825 ] Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:12 PM

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825 ] Markus Jelsma commented on NUTCH-2978: -- This morning i saw one of our internal projects spewing

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644808#comment-17644808 ] Markus Jelsma commented on NUTCH-2924: -- Here's the proper patch, finally. > Generate maxCount e

[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-5.patch > Generate maxCount expr evaluated only o

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644491#comment-17644491 ] Markus Jelsma commented on NUTCH-2978: -- Yes, i saw the slf4j present in the plugin, it troubled my

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644489#comment-17644489 ] Markus Jelsma commented on NUTCH-2924: -- Yes, that is expected. This patch requires a hostdb

[jira] [Resolved] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2977. -- Fix Version/s: 1.20 Resolution: Fixed > Support for showing dependency t

[jira] [Commented] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644436#comment-17644436 ] Markus Jelsma commented on NUTCH-2977: -- {color:#00}Committed:{color}   ed7b6615b..d806aa450

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Description: I got in trouble upgrading some dependencies and got a lot of LinkageErrors today

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Description: I got in trouble upgrading some dependencies and got a lot of LinkageErrors today

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Created] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2978: Summary: Move to slf4j2 and remove log4j1 and reload4j Key: NUTCH-2978 URL: https://issues.apache.org/jira/browse/NUTCH-2978 Project: Nutch Issue Type: Task

[jira] [Updated] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2977: - Attachment: NUTCH-2977.patch > Support for showing dependency t

[jira] [Updated] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2977: - Description: I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff

[jira] [Created] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2977: Summary: Support for showing dependency tree Key: NUTCH-2977 URL: https://issues.apache.org/jira/browse/NUTCH-2977 Project: Nutch Issue Type: Task

  1   2   3   4   5   6   7   8   9   10   >