[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-09-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880239#comment-17880239 ] Markus Jelsma commented on NUTCH-1806: -- Yes, this seems fine when glancing a

[jira] [Commented] (NUTCH-3056) Injector to support resolving seed URLs

2024-06-17 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855579#comment-17855579 ] Markus Jelsma commented on NUTCH-3056: -- Initial 1.15 patch. Set {color:#00

[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-06-17 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Attachment: NUTCH-3056.patch > Injector to support resolving seed U

[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may not

[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3056: - Description: We have a case where clients submit huge uncurated seed files, the host may not

[jira] [Created] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3056: Summary: Injector to support resolving seed URLs Key: NUTCH-3056 URL: https://issues.apache.org/jira/browse/NUTCH-3056 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842384#comment-17842384 ] Markus Jelsma commented on NUTCH-3028: -- Ok, the Content object is now

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-2.patch > WARCExported to support filtering by J

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next

[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836133#comment-17836133 ] Markus Jelsma commented on NUTCH-3039: -- Thanks for spotting that! > Fai

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827048#comment-17827048 ] Markus Jelsma commented on NUTCH-3029: -- comment describing throws is also requ

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826823#comment-17826823 ] Markus Jelsma commented on NUTCH-3029: -- throws was missing too    84cda

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826783#comment-17826783 ] Markus Jelsma commented on NUTCH-3029: -- Thanks Lewis!    5ba50c0c6..84cda

[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826759#comment-17826759 ] Markus Jelsma commented on NUTCH-3029: --    4f62dec0f..5ba50c0c6  master ->

[jira] [Commented] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826760#comment-17826760 ] Markus Jelsma commented on NUTCH-3033: -- Ah, the new ivy works like a charm! Th

[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3029. -- Resolution: Fixed Thanks Martin!    551c50b1c..4642c30c2  master -> master > Host sp

[jira] [Resolved] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3030. -- Resolution: Fixed 42b55f6a9..551c50b1c  master -> master   Thanks Martin!   > Use

[jira] [Updated] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3030: - Summary: Use system default cipher suites instead of hard-coded set (was: Update default TLS

[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825863#comment-17825863 ] Markus Jelsma commented on NUTCH-3032: -- No idea what git fork is supposed t

[jira] [Resolved] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-12 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3031. -- Resolution: Fixed    83acd501e..c390dfc8b  master -> master > ProtocolFactory host map

Re: [DISCUSS] Release Nutch 1.20

2024-03-10 Thread Markus Jelsma
Good idea! I'll finish work on three open issues the next week. Op za 9 mrt 2024 om 13:02 schreef Sebastian Nagel < wastl.na...@googlemail.com>: > Hi Lewis, > > yes, of course! > > Some points we should do before the release: > > - address the ES licensing issue, >the easiest way is to downgr

[jira] [Updated] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3031: - Attachment: NUTCH-3031.patch > ProtocolFactory host mapper to support doma

[jira] [Created] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3031: Summary: ProtocolFactory host mapper to support domains Key: NUTCH-3031 URL: https://issues.apache.org/jira/browse/NUTCH-3031 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818531#comment-17818531 ] Markus Jelsma commented on NUTCH-3030: -- For some reason the attached patch did

[jira] [Updated] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3030: - Attachment: NUTCH-3030.patch > Update default TLS cipher suites for http(s) proto

[jira] [Assigned] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3030: Assignee: Markus Jelsma > Update default TLS cipher suites for http(s) proto

[jira] [Assigned] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3029: Assignee: Markus Jelsma > Host specific max. and min. intervals in adaptive schedu

Tika parsing error 1.19

2024-02-15 Thread Markus Jelsma
Hi, We were doing some tests with 1.19 and found that some sites became unparsable using Tika. At this moment i know of at least two sites causing this, my own, https://www.openindex.io/ and https://www.elzendaalcollege.nl/ 2024-02-15 12:33:49,639 WARN o.a.n.p.ParseUtil [main] Error parsing https

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815345#comment-17815345 ] Markus Jelsma commented on NUTCH-3028: -- New patch: when expression was not set

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-1.patch > WARCExported to support filtering by J

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-06 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814731#comment-17814731 ] Markus Jelsma commented on NUTCH-3028: -- Any objections to this one before i ge

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3027.patch > WARCExported to support filtering by J

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: (was: NUTCH-3027.patch) > WARCExported to support filtering by J

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028.patch > WARCExported to support filtering by J

[jira] [Created] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3028: Summary: WARCExported to support filtering by JEXL Key: NUTCH-3028 URL: https://issues.apache.org/jira/browse/NUTCH-3028 Project: Nutch Issue Type

[jira] [Work started] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3027 started by Markus Jelsma. > Trivial resource leak patch in DomainSuffixes.j

[jira] [Commented] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808614#comment-17808614 ] Markus Jelsma commented on NUTCH-3027: -- Thanks Sascha Kehrli! Committed  {c

[jira] [Resolved] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-3027. -- Fix Version/s: 1.20 Resolution: Fixed > Trivial resource leak patch

[jira] [Assigned] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-3027: Assignee: Markus Jelsma > Trivial resource leak patch in DomainSuffixes.j

[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771023#comment-17771023 ] Markus Jelsma commented on NUTCH-1635: -- Good point! No, we haven't

[jira] [Closed] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1635. Resolution: Not A Problem > New crawldb sometimes ends up in curr

[jira] [Commented] (NUTCH-3007) Fix impossible casts

2023-09-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769989#comment-17769989 ] Markus Jelsma commented on NUTCH-3007: -- +1 yes! > Fix impossibl

[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769988#comment-17769988 ] Markus Jelsma commented on NUTCH-2852: -- Seems just fine for these file

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-18 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766306#comment-17766306 ] Markus Jelsma commented on NUTCH-2978: -- Thanks for picking it up. I am very h

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764699#comment-17764699 ] Markus Jelsma commented on NUTCH-2978: -- You managed to get it up and running

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764697#comment-17764697 ] Markus Jelsma commented on NUTCH-3000: -- Yes, this is a bit odd indeed

[jira] [Commented] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760522#comment-17760522 ] Markus Jelsma commented on NUTCH-2999: -- Seems fine +1 > Update Lucene ver

[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738191#comment-17738191 ] Markus Jelsma commented on NUTCH-2993: -- To be honest, i am not too happy with

[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738047#comment-17738047 ] Markus Jelsma commented on NUTCH-2993: -- Thanks Sebastian! # changed the ch

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15-1.patch > ScoringDepth plugin to skip depth check based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993.patch) > ScoringDepth plugin to skip depth check based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993.patch > ScoringDepth plugin to skip depth check based on URL Patt

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Description: We do not want some crawl to go deep and broad, but instead focus it on a narrow

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to skip depth check based on URL Patt

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15-1.patch) > ScoringDepth plugin to skip depth check ba

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to skip depth check based

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15-1.patch > ScoringDepth plugin to skip depth check based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Description: We do not want some crawl to go deep and broad, but instead focus it on a narrow

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Summary: ScoringDepth plugin to skip depth check based on URL Pattern (was: ScoringDepth plugin

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to override maxDepth based

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: (was: NUTCH-2993-1.15.patch) > ScoringDepth plugin to override maxDepth based

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on

[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730535#comment-17730535 ] Markus Jelsma commented on NUTCH-2993: -- Here's a simple patch against N

[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2993: - Attachment: NUTCH-2993-1.15.patch > ScoringDepth plugin to override maxDepth based on

[jira] [Created] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2993: Summary: ScoringDepth plugin to override maxDepth based on URL Pattern Key: NUTCH-2993 URL: https://issues.apache.org/jira/browse/NUTCH-2993 Project: Nutch

[jira] [Commented] (NUTCH-2985) Disable plugin urlfilter-validator by default

2023-02-24 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693291#comment-17693291 ] Markus Jelsma commented on NUTCH-2985: -- +1 > Disable plugin urlfilter-valid

Re: Upgrading Selenium

2023-01-20 Thread Markus Jelsma
> There must be a way, some how, some time. There isn't: https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141 Op do 19 jan. 2023 om 15:23 schreef Markus Jelsma < markus.jel...@openindex.io>: > > This makes some sense if you do not know anything about t

Re: Upgrading Selenium

2023-01-19 Thread Markus Jelsma
les, >and other stuff not suitable for Selenium. Could make the HEAD request >optional. > > > merging the lib-selenium plugin with the protocol-selenium plugin > > I guess lib-selenium is to share common components between > protocol-selenium and > protocol-interactivesel

Re: Upgrading Selenium

2023-01-17 Thread Markus Jelsma
Hello Kamil, Yes, the plugin needs some upgrading indeed. We use a modern version of it elsewhere and it works really well, at least better than HtmlUnit. Besides that, the plugin also needs some overhaul. It currently first downloads the URL with HttpClient, and then, depending on MIME-type, it

[jira] [Commented] (NUTCH-2974) Ant build fails with "Unparseable date" on certain platforms

2023-01-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677435#comment-17677435 ] Markus Jelsma commented on NUTCH-2974: -- Sounds like a nice solution for

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-06 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655383#comment-17655383 ] Markus Jelsma commented on NUTCH-2634: -- +1 > Some links marked as "n

[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651243#comment-17651243 ] Markus Jelsma edited comment on NUTCH-2978 at 12/22/22 11:3

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651243#comment-17651243 ] Markus Jelsma commented on NUTCH-2978: -- Ah nope, this is not it. Parse-tika th

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648636#comment-17648636 ] Markus Jelsma commented on NUTCH-2978: -- New patch now makes sure there is a l

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-3.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648633#comment-17648633 ] Markus Jelsma commented on NUTCH-2978: -- Ok, i also wanted to get rid of loose l

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648625#comment-17648625 ] Markus Jelsma commented on NUTCH-2978: -- Patch now includes Sebastian's p

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-2.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: (was: NUTCH-2978-1.patch) > Move to slf4j2 and remove log4j1 and reloa

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-1.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978-1.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-15 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648060#comment-17648060 ] Markus Jelsma commented on NUTCH-2978: -- Ah yes, thanks! I am not sure

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646681#comment-17646681 ] Markus Jelsma commented on NUTCH-2978: -- About the slf issues, Somewhere ano

[jira] [Resolved] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-12 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2924. -- Resolution: Fixed {color:#00}To https://gitbox.apache.org/repos/asf/nutch.git {color

[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644838#comment-17644838 ] Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:3

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644838#comment-17644838 ] Markus Jelsma commented on NUTCH-2978: -- Ah, well. I also tried a Tika par

[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644825#comment-17644825 ] Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:1

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644825#comment-17644825 ] Markus Jelsma commented on NUTCH-2978: -- This morning i saw one of our inte

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644808#comment-17644808 ] Markus Jelsma commented on NUTCH-2924: -- Here's the proper patch

[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-5.patch > Generate maxCount expr evaluated only o

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644491#comment-17644491 ] Markus Jelsma commented on NUTCH-2978: -- Yes, i saw the slf4j present in the pl

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644489#comment-17644489 ] Markus Jelsma commented on NUTCH-2924: -- Yes, that is expected. This patch requ

[jira] [Resolved] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2977. -- Fix Version/s: 1.20 Resolution: Fixed > Support for showing dependency t

[jira] [Commented] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644436#comment-17644436 ] Markus Jelsma commented on NUTCH-2977: -- {color:#00}Committed:{c

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Description: I got in trouble upgrading some dependencies and got a lot of LinkageErrors today

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Description: I got in trouble upgrading some dependencies and got a lot of LinkageErrors today

[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2978: - Attachment: NUTCH-2978.patch > Move to slf4j2 and remove log4j1 and reloa

[jira] [Created] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2978: Summary: Move to slf4j2 and remove log4j1 and reload4j Key: NUTCH-2978 URL: https://issues.apache.org/jira/browse/NUTCH-2978 Project: Nutch Issue Type: Task

  1   2   3   4   5   6   7   8   9   10   >