[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644298#comment-17644298 ] Markus Jelsma commented on NUTCH-2924: -- Updated patch for master. > Generate maxCount e

[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-4.patch > Generate maxCount expr evaluated only o

[jira] [Commented] (NUTCH-2973) Single domain names (eg https://localnet) can't be crawled - filtering fails

2022-10-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621023#comment-17621023 ] Markus Jelsma commented on NUTCH-2973: -- Hello David, By default urlfilter-validator is an active

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread Markus Jelsma
Hi, Everything seems fine, the crawler seems fine when trying the binary distribution. The source won't work because this computer still cannot compile it. Clearing the local Ivy cache did not do much. This is the known compiler error with the elastic-indexer plugin: compile: [echo] Compiling

[jira] [Commented] (NUTCH-2969) Javadoc: Javascript search is not working when built on JDK 11

2022-08-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582978#comment-17582978 ] Markus Jelsma commented on NUTCH-2969: -- Nice! > Javadoc: Javascript search is not working w

[jira] [Commented] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2022-08-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582859#comment-17582859 ] Markus Jelsma commented on NUTCH-2960: -- Yes, this would be much preferred over removing the binaries

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577447#comment-17577447 ] Markus Jelsma commented on NUTCH-2959: -- Nice, thanks to NUTCH-2669 i can pass the issue by using

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577420#comment-17577420 ] Markus Jelsma commented on NUTCH-2959: -- Here's a patch. This patch does not include the change

[jira] [Updated] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2959: - Attachment: NUTCH-2959.patch > Upgrade to Apache Tika 2.

[jira] [Created] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2959: Summary: Upgrade to Apache Tika 2.4.1 Key: NUTCH-2959 URL: https://issues.apache.org/jira/browse/NUTCH-2959 Project: Nutch Issue Type: Task

Re: [DISCUSS] Release 1.19 ?

2022-08-09 Thread Markus Jelsma
Sounds good! I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the current 2.4.1. Thanks! Markus Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel : > Hi all, > > more than 60 issues are done for Nutch 1.19 > > https://issues.apache.org/jira/projects/NUTCH/versions/12349580

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-08-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576811#comment-17576811 ] Markus Jelsma commented on NUTCH-2953: -- Yes, this is what it should look like +1 > Indexer Elas

[jira] [Commented] (NUTCH-2956) index-geoip: dependency upgrades and improvements

2022-08-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576662#comment-17576662 ] Markus Jelsma commented on NUTCH-2956: -- Looks good! +1 > index-geoip: dependency upgra

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556860#comment-17556860 ] Markus Jelsma commented on NUTCH-2953: -- I don't understand, i cannot even compile a clean checkout

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556845#comment-17556845 ] Markus Jelsma commented on NUTCH-2953: -- Bah, updating the patch was as easy as expected, but i run

[jira] [Updated] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2953: - Attachment: NUTCH-2953-1.patch > Indexer Elastic to ignore SSL iss

[jira] [Commented] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556830#comment-17556830 ] Markus Jelsma commented on NUTCH-2953: -- Yes, the current patch is for 1.18 only. Modifying

[jira] [Updated] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2953: - Attachment: NUTCH-2953.patch > Indexer Elastic to ignore SSL iss

[jira] [Created] (NUTCH-2953) Indexer Elastic to ignore SSL issues

2022-06-20 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2953: Summary: Indexer Elastic to ignore SSL issues Key: NUTCH-2953 URL: https://issues.apache.org/jira/browse/NUTCH-2953 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540128#comment-17540128 ] Markus Jelsma commented on NUTCH-2950: -- I've seen the patch, there's no need to split it up

[jira] [Commented] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-04 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531695#comment-17531695 ] Markus Jelsma commented on NUTCH-2946: -- Nice! +1 > Fetcher: optionally slow down fetching f

[jira] [Commented] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-03 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531250#comment-17531250 ] Markus Jelsma commented on NUTCH-2946: -- Doubling the delay per error sounds fine. I am not sure

[jira] [Commented] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-03 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531217#comment-17531217 ] Markus Jelsma commented on NUTCH-2946: -- Sounds good! If you'd prefer this to be optional, i would

[jira] [Commented] (NUTCH-2929) Fetcher: start threads slowly to avoid that resources are temporarily exhausted

2022-01-11 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472772#comment-17472772 ] Markus Jelsma commented on NUTCH-2929: -- I haven't seen this problem in our crawler before. The 10ms

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-31 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17467279#comment-17467279 ] Markus Jelsma commented on NUTCH-2924: -- Updated patch: failure in an expression, e.g. field does

[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-31 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-3.patch > Generate maxCount expr evaluated only o

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-31 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17467272#comment-17467272 ] Markus Jelsma commented on NUTCH-2924: -- Well, to old approach did work if you had a few hosts

[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-31 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-2.patch > Generate maxCount expr evaluated only o

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466942#comment-17466942 ] Markus Jelsma commented on NUTCH-2924: -- Updated patch, logging INFO > DEBUG. Otherwise slow reduc

[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924-1.patch > Generate maxCount expr evaluated only o

[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2924: - Attachment: NUTCH-2924.patch > Generate maxCount expr evaluated only o

[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466926#comment-17466926 ] Markus Jelsma commented on NUTCH-2924: -- Patch, again, only for 1.15, for now. > Generate maxCo

[jira] [Created] (NUTCH-2924) Generate maxCount expr evaluated only once

2021-12-30 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2924: Summary: Generate maxCount expr evaluated only once Key: NUTCH-2924 URL: https://issues.apache.org/jira/browse/NUTCH-2924 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-2917) Remove transitive dependency to log4j 1.x

2021-12-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463207#comment-17463207 ] Markus Jelsma commented on NUTCH-2917: -- I think we can remove it. Hadoop picks up anything that logs

[jira] [Commented] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463200#comment-17463200 ] Markus Jelsma commented on NUTCH-2921: -- Haha, that was stupid! Attached correct patch, i hope

[jira] [Updated] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2921: - Attachment: (was: NUTCH-2912.patch) > DepthScoringFilter option to reset max_de

[jira] [Updated] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2921: - Attachment: NUTCH-2921.patch > DepthScoringFilter option to reset max_de

[jira] [Commented] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462560#comment-17462560 ] Markus Jelsma commented on NUTCH-2921: -- Patched against 1.15. Set scoring.depth.reset.max to a non

[jira] [Updated] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2921: - Attachment: NUTCH-2912.patch > DepthScoringFilter option to reset max_de

[jira] [Created] (NUTCH-2921) DepthScoringFilter option to reset max_depth

2021-12-20 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2921: Summary: DepthScoringFilter option to reset max_depth Key: NUTCH-2921 URL: https://issues.apache.org/jira/browse/NUTCH-2921 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2912) CrawlDatumProcessor to calculate crawl completeness

2021-12-02 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452412#comment-17452412 ] Markus Jelsma commented on NUTCH-2912: -- Added new patch, because, as usual, my first is incorrect

[jira] [Updated] (NUTCH-2912) CrawlDatumProcessor to calculate crawl completeness

2021-12-02 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2912: - Attachment: NUTCH-2912-1.patch > CrawlDatumProcessor to calculate crawl completen

[jira] [Updated] (NUTCH-2912) CrawlDatumProcessor to calculate crawl completeness

2021-12-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2912: - Attachment: NUTCH-2912.patch > CrawlDatumProcessor to calculate crawl completen

[jira] [Created] (NUTCH-2912) CrawlDatumProcessor to calculate crawl completeness

2021-12-01 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2912: Summary: CrawlDatumProcessor to calculate crawl completeness Key: NUTCH-2912 URL: https://issues.apache.org/jira/browse/NUTCH-2912 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2908) Log mapreduce job messages and counters in local mode

2021-11-23 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447947#comment-17447947 ] Markus Jelsma commented on NUTCH-2908: -- This is it? In that case, it is an excellent oneliner

[jira] [Commented] (NUTCH-2867) Support for custom HostDb aggregators

2021-11-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447394#comment-17447394 ] Markus Jelsma commented on NUTCH-2867: -- Yes, both points are valid and find. Please also update

[jira] [Commented] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-11-22 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447388#comment-17447388 ] Markus Jelsma commented on NUTCH-2865: -- Hello Sebastian, no objections, it looks fine as usual

[jira] [Commented] (NUTCH-2867) Support for custom HostDb aggregators

2021-11-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446563#comment-17446563 ] Markus Jelsma commented on NUTCH-2867: -- Of course, how typical. Here's a new patch. Hee, Jira seems

[jira] [Updated] (NUTCH-2867) Support for custom HostDb aggregators

2021-11-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2867: - Attachment: (was: NUTCH-2867-1.patch) > Support for custom HostDb aggregat

[jira] [Updated] (NUTCH-2867) Support for custom HostDb aggregators

2021-11-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2867: - Attachment: NUTCH-2867-1.patch > Support for custom HostDb aggregat

[jira] [Updated] (NUTCH-2867) Support for custom HostDb aggregators

2021-11-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2867: - Attachment: NUTCH-2867-1.patch > Support for custom HostDb aggregat

[jira] [Commented] (NUTCH-2869) Add @Override annotations to Nutch plugins

2021-06-10 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360897#comment-17360897 ] Markus Jelsma commented on NUTCH-2869: -- Yes! +1 > Add @Override annotations to Nutch plug

[jira] [Updated] (NUTCH-2867) Support for custom HostDb aggregators

2021-06-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2867: - Attachment: NUTCH-2867.patch > Support for custom HostDb aggregat

[jira] [Updated] (NUTCH-2867) Support for custom HostDb aggregators

2021-06-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2867: - Attachment: (was: NUTCH-2867.patch) > Support for custom HostDb aggregat

[jira] [Commented] (NUTCH-2867) Support for custom HostDb aggregators

2021-06-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360020#comment-17360020 ] Markus Jelsma commented on NUTCH-2867: -- Simple patch with example processor, showing how easy

[jira] [Created] (NUTCH-2867) Support for custom HostDb aggregators

2021-06-09 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2867: Summary: Support for custom HostDb aggregators Key: NUTCH-2867 URL: https://issues.apache.org/jira/browse/NUTCH-2867 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2867) Support for custom HostDb aggregators

2021-06-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2867: - Attachment: NUTCH-2867.patch > Support for custom HostDb aggregat

[jira] [Commented] (NUTCH-2855) Update org.elasticsearch.client

2021-06-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359914#comment-17359914 ] Markus Jelsma commented on NUTCH-2855: -- Ugh, i am stuck. Cleaning libs, caches, builds, new

[jira] [Comment Edited] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359353#comment-17359353 ] Markus Jelsma edited comment on NUTCH-2865 at 6/8/21, 1:28 PM: --- * separated

[jira] [Commented] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359353#comment-17359353 ] Markus Jelsma commented on NUTCH-2865: -- * separated the parseText and parseData * changed

[jira] [Updated] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2865: - Attachment: NUTCH-2865.patch > WARC exporter support for metadata and dropping empty respon

[jira] [Updated] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-06-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2865: - Description: WARCExporter is a handy tool to dump the segments. Unfortunately it also emits

[jira] [Commented] (NUTCH-2866) MetaData.toString() should return "key=value ..."

2021-06-01 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355009#comment-17355009 ] Markus Jelsma commented on NUTCH-2866: -- Great patch! +1   > MetaData.toString() should return &

[jira] [Commented] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-26 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351745#comment-17351745 ] Markus Jelsma commented on NUTCH-2865: -- Any comments on this? > WARC exporter support for metad

[jira] [Updated] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-24 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2865: - Description: WARCExporter is a handy tool to dump the segments. Unfortunately it also emits

[jira] [Updated] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2865: - Attachment: (was: NUTCH-2865.patch) > WARC exporter support for metadata and dropping em

[jira] [Updated] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2865: - Attachment: NUTCH-2865.patch > WARC exporter support for metadata and dropping empty respon

[jira] [Updated] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2865: - Attachment: NUTCH-2865.patch > WARC exporter support for metadata and dropping empty respon

[jira] [Commented] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348203#comment-17348203 ] Markus Jelsma commented on NUTCH-2865: -- Updated patch, there was still a println() somewhere

[jira] [Updated] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2865: - Attachment: NUTCH-2865.patch > WARC exporter support for metadata and dropping empty respon

[jira] [Created] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-20 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2865: Summary: WARC exporter support for metadata and dropping empty responses Key: NUTCH-2865 URL: https://issues.apache.org/jira/browse/NUTCH-2865 Project: Nutch

[jira] [Commented] (NUTCH-2855) Update org.elasticsearch.client

2021-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348189#comment-17348189 ] Markus Jelsma commented on NUTCH-2855: -- I am using OpenJDK 11.0.11 on Kubuntu 20.04 LTS, and am

[jira] [Commented] (NUTCH-2855) Update org.elasticsearch.client

2021-05-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348168#comment-17348168 ] Markus Jelsma commented on NUTCH-2855: -- Something is wrong, i removed ~/.ant and ~/.iv2 and ant

[jira] [Commented] (NUTCH-2855) Update org.elasticsearch.client

2021-05-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347700#comment-17347700 ] Markus Jelsma commented on NUTCH-2855: -- I cleaned my build and made the checkout up to date, but i

[jira] [Commented] (NUTCH-2863) Injector to parse command-line flags case-insensitive

2021-05-07 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340782#comment-17340782 ] Markus Jelsma commented on NUTCH-2863: -- Good point! Ignoring case would keep compatilibty and would

[jira] [Commented] (NUTCH-2859) urlnormalizer-protocol: allow to normalize domains

2021-03-29 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17310697#comment-17310697 ] Markus Jelsma commented on NUTCH-2859: -- Looks great! +1 Thanks! > urlnormalizer-protocol: al

RE: [DISCUSS] Replacing MapReduce with Tez

2020-12-21 Thread Markus Jelsma
Hello Lewis, 1. counters, for me they are a requirement to have as they are key to regular inspections of ongoing crawls, finding errors and debugging. I hope you can find a work around. 2. sounds interesting, but i'd like to see the test run with 12M rather than 12k URLs. A question, are

[jira] [Commented] (NUTCH-2835) Upgrade commons-jexl from 2 --> 3

2020-12-18 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251709#comment-17251709 ] Markus Jelsma commented on NUTCH-2835: -- Thanks! > Upgrade commons-jexl from 2 --

[jira] [Commented] (NUTCH-2833) Upgrade to Tika 1.25

2020-12-02 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242307#comment-17242307 ] Markus Jelsma commented on NUTCH-2833: -- +1 > Upgrade to Tika 1

[jira] [Commented] (NUTCH-2764) Weird build error javax.javax.measure#unit-api

2020-11-15 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232315#comment-17232315 ] Markus Jelsma commented on NUTCH-2764: -- Sounds great, especially the hint of ~/.ant/lib/. Those

[jira] [Closed] (NUTCH-2764) Weird build error javax.javax.measure#unit-api

2020-11-15 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2764. Resolution: Information Provided > Weird build error javax.javax.measure#unit-

[jira] [Commented] (NUTCH-2764) Weird build error javax.javax.measure#unit-api

2020-11-10 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229133#comment-17229133 ] Markus Jelsma commented on NUTCH-2764: -- Ah, this dreaded error! I am sorry but i don't know

RE: [ANNOUNCE] Apache Nutch 1.17 Release

2020-07-02 Thread Markus Jelsma
Thanks Sebastian! -Original message- > From:Sebastian Nagel > Sent: Thursday 2nd July 2020 16:42 > To: u...@nutch.apache.org > Cc: dev@nutch.apache.org; annou...@apache.org > Subject: [ANNOUNCE] Apache Nutch 1.17 Release > > The Apache Nutch team is pleased to announce the release

RE: [VOTE] Release Apache Nutch 1.17 RC#1

2020-06-30 Thread Markus Jelsma
Hello, +1 from me too! Thanks, Markus -Original message- > From:Furkan KAMACI > Sent: Saturday 20th June 2020 18:15 > To: dev@nutch.apache.org > Subject: Re: [VOTE] Release Apache Nutch 1.17 RC#1 > > Hi, > > +1 from me (binding). > > I checked: > > - LICENSE and NOTICE are fine  >

[jira] [Commented] (NUTCH-2794) Add additional ciphers to HTTP base's default cipher suite

2020-06-17 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138560#comment-17138560 ] Markus Jelsma commented on NUTCH-2794: -- Will you then add the entry to the changelog?   >

[jira] [Resolved] (NUTCH-2794) Add additional ciphers to HTTP base's default cipher suite

2020-06-17 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2794. -- Resolution: Fixed To https://gitbox.apache.org/repos/asf/nutch.git 59d0d9532..1c2e4110c

[jira] [Updated] (NUTCH-2794) Add additional ciphers to HTTP base's default cipher suite

2020-06-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2794: - Attachment: NUTCH-2794.patch > Add additional ciphers to HTTP base's default cipher su

[jira] [Updated] (NUTCH-2794) Add additional ciphers to HTTP base's default cipher suite

2020-06-16 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2794: - Attachment: NUTCH-2794.patch > Add additional ciphers to HTTP base's default cipher su

[jira] [Created] (NUTCH-2794) Add additional ciphers to HTTP base's default cipher suite

2020-06-16 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2794: Summary: Add additional ciphers to HTTP base's default cipher suite Key: NUTCH-2794 URL: https://issues.apache.org/jira/browse/NUTCH-2794 Project: Nutch

RE: [PROPOSAL] Replace whitelist blacklist with allowlist denylist

2020-06-10 Thread Markus Jelsma
Hello Lewis, I understand the proposal. As an engineer, however, i have some points i would like to address: * The proposed change is not backward compatible, which weighs heavy because it is also not a technical necessity. * Our users, myself included, have to make a small or, depending on

[jira] [Commented] (NUTCH-2419) Some URL filters and normalizers do not respect command-line override for rule file

2020-05-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106273#comment-17106273 ] Markus Jelsma commented on NUTCH-2419: -- Cool! +1 > Some URL filters and normalizers do not resp

[jira] [Commented] (NUTCH-2419) Some URL filters and normalizers do not respect command-line override for rule file

2020-05-13 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106184#comment-17106184 ] Markus Jelsma commented on NUTCH-2419: -- Hello Sebastian! I glanced over the diff and it seems

[jira] [Commented] (NUTCH-2434) Add methods to reset parameters HTMLMetaTags

2020-04-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096645#comment-17096645 ] Markus Jelsma commented on NUTCH-2434: -- Ah, thanks! > Add methods to reset parameters HTMLMetaT

[jira] [Commented] (NUTCH-2775) Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay

2020-02-29 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048368#comment-17048368 ] Markus Jelsma commented on NUTCH-2775: -- {quote}What about ignoring Crawl-Delay values shorter than

[jira] [Commented] (NUTCH-2769) Nutch 1.15 unable to parse certain outlinks

2020-02-26 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045401#comment-17045401 ] Markus Jelsma commented on NUTCH-2769: -- The parse-html plugin indeed does not output some of those

[jira] [Commented] (NUTCH-2767) Fetcher to stop filling queues skipped due to repeated exceptions

2020-02-19 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040237#comment-17040237 ] Markus Jelsma commented on NUTCH-2767: -- I had to check whether Nutch was already on Java 8 when i

RE: Fosdem

2020-02-07 Thread Markus Jelsma
Hi, Quite some time ago someone announced they had launched a web store selling Nutch branded stuff. It still seems to be around: https://www.cafepress.com/nutch How about that? Markus -Original message- > From:Sebastian Nagel > Sent: Friday 7th February 2020 12:23 > To:

[jira] [Commented] (NUTCH-2761) ivy jar fails to download

2020-01-21 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020189#comment-17020189 ] Markus Jelsma commented on NUTCH-2761: -- You are right, it is not related to this problem. I assumed

[jira] [Created] (NUTCH-2764) Weird build error javax.javax.measure#unit-api

2020-01-21 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2764: Summary: Weird build error javax.javax.measure#unit-api Key: NUTCH-2764 URL: https://issues.apache.org/jira/browse/NUTCH-2764 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-2761) ivy jar fails to download

2020-01-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019653#comment-17019653 ] Markus Jelsma commented on NUTCH-2761: -- I believe the next problem is related to this issue, but i

<    1   2   3   4   5   6   7   8   9   10   >