[jira] [Commented] (NUTCH-2761) ivy jar fails to download

2020-01-17 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018046#comment-17018046 ] Markus Jelsma commented on NUTCH-2761: -- thanks! > ivy jar fails to downl

[jira] [Commented] (NUTCH-2733) protocol-okhttp: add support for Brotli compression (Content-Encoding)

2020-01-17 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018035#comment-17018035 ] Markus Jelsma commented on NUTCH-2733: -- Sounds good! +1 > protocol-okhttp: add support for Bro

[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

2019-11-08 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970109#comment-16970109 ] Markus Jelsma commented on NUTCH-2748: -- Nice catch! I think i would prefer the first option

RE: [VOTE] Release Apache Nutch 1.16 RC#1

2019-10-03 Thread Markus Jelsma
Hello Sebastian, All tests pass nicely and i can easily run a crawl. +1 Thanks, Markus By the way, what does this mean: 2019-10-03 12:48:49,696 INFO  crawl.Generator - Generator: number of items rejected during selection: 2019-10-03 12:48:49,698 INFO  crawl.Generator - Generator:  1 

[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2019-09-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925665#comment-16925665 ] Markus Jelsma commented on NUTCH-2612: -- The error is all mine! The corrected version is committed

[jira] [Resolved] (NUTCH-2612) Support for sitemap processing by hostname

2019-09-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2612. -- Resolution: Fixed > Support for sitemap processing by hostn

[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2019-09-09 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925644#comment-16925644 ] Markus Jelsma commented on NUTCH-2612: -- Hello [~wastl-nagel], are you sure? I just cleaned my

[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2019-09-06 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924187#comment-16924187 ] Markus Jelsma commented on NUTCH-2612: -- Any thoughts left? I'd like to get this one in. > Supp

[jira] [Commented] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-08-30 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919618#comment-16919618 ] Markus Jelsma commented on NUTCH-2669: -- Great! Thanks Sebastian! > Reliable solution for javax

[jira] [Commented] (NUTCH-2730) SitemapProcessor to treat sitemap URLs as Set instead of List

2019-08-20 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911699#comment-16911699 ] Markus Jelsma commented on NUTCH-2730: -- Hello [~wastl-nagel]! Yes Crawler-Commons is definitely

[jira] [Updated] (NUTCH-2730) SitemapProcessor to treat sitemap URLs as Set instead of List

2019-08-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2730: - Attachment: NUTCH-2730.patch > SitemapProcessor to treat sitemap URLs as Set instead of L

[jira] [Created] (NUTCH-2730) SitemapProcessor to treat sitemap URLs as Set instead of List

2019-08-16 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2730: Summary: SitemapProcessor to treat sitemap URLs as Set instead of List Key: NUTCH-2730 URL: https://issues.apache.org/jira/browse/NUTCH-2730 Project: Nutch

[jira] [Commented] (NUTCH-2727) Upgrade Hadoop dependencies to 2.9.2

2019-08-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900978#comment-16900978 ] Markus Jelsma commented on NUTCH-2727: -- Good point! Indeed, we just run Nutch on 3.2.0. We do

[jira] [Comment Edited] (NUTCH-2727) Upgrade Hadoop dependencies to 2.9.2

2019-08-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900978#comment-16900978 ] Markus Jelsma edited comment on NUTCH-2727 at 8/6/19 12:29 PM: --- Good point

[jira] [Comment Edited] (NUTCH-2727) Upgrade Hadoop dependencies to 2.9.2

2019-08-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900908#comment-16900908 ] Markus Jelsma edited comment on NUTCH-2727 at 8/6/19 11:19 AM: --- Hello

[jira] [Commented] (NUTCH-2727) Upgrade Hadoop dependencies to 2.9.2

2019-08-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900908#comment-16900908 ] Markus Jelsma commented on NUTCH-2727: -- Hello @snagel We have been running it, and other programs

[jira] [Closed] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-29 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2725. > Plugin lib-http to support per-host configurable cook

[jira] [Resolved] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-29 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2725. -- Resolution: Fixed > Plugin lib-http to support per-host configurable cook

[jira] [Commented] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-29 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895139#comment-16895139 ] Markus Jelsma commented on NUTCH-2725: -- Committed a67c9bee..54f73bf7 master -> master Tha

[jira] [Updated] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2725: - Attachment: NUTCH-2725.patch > Plugin lib-http to support per-host configurable cook

[jira] [Commented] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892873#comment-16892873 ] Markus Jelsma commented on NUTCH-2725: -- Addressed all three points. Thanks Sebastian! > Plugin

[jira] [Updated] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2725: - Attachment: NUTCH-2725.patch > Plugin lib-http to support per-host configurable cook

[jira] [Updated] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2725: - Patch Info: Patch Available > Plugin lib-http to support per-host configurable cook

[jira] [Created] (NUTCH-2725) Plugin lib-http to support per-host configurable cookies

2019-07-25 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2725: Summary: Plugin lib-http to support per-host configurable cookies Key: NUTCH-2725 URL: https://issues.apache.org/jira/browse/NUTCH-2725 Project: Nutch Issue

[jira] [Closed] (NUTCH-2724) Metadata indexer not to emit empty values

2019-07-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2724. > Metadata indexer not to emit empty val

[jira] [Resolved] (NUTCH-2724) Metadata indexer not to emit empty values

2019-07-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2724. -- Resolution: Fixed Committed 96924648..a67c9bee master -> master Thanks! > Metadata i

[jira] [Updated] (NUTCH-2724) Metadata indexer not to emit empty values

2019-07-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2724: - Attachment: NUTCH-2724.patch > Metadata indexer not to emit empty val

[jira] [Commented] (NUTCH-2724) Metadata indexer not to emit empty values

2019-07-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883698#comment-16883698 ] Markus Jelsma commented on NUTCH-2724: -- Of course, thanks, i should use isEmpty() more often, length

[jira] [Closed] (NUTCH-2723) Indexer Solr not to decode URLs before deletion

2019-07-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2723. > Indexer Solr not to decode URLs before delet

[jira] [Resolved] (NUTCH-2723) Indexer Solr not to decode URLs before deletion

2019-07-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2723. -- Resolution: Fixed Thanks Sebastian! Committed fc6a2742..96924648 master -> master > I

[jira] [Commented] (NUTCH-2722) Fetch dependencies via https

2019-07-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881287#comment-16881287 ] Markus Jelsma commented on NUTCH-2722: -- I removed my Ivy cache, patched it and it went all fine. I

[jira] [Updated] (NUTCH-2724) Metadata indexer not to emit empty values

2019-06-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2724: - Attachment: NUTCH-2724.patch > Metadata indexer not to emit empty val

[jira] [Created] (NUTCH-2724) Metadata indexer not to emit empty values

2019-06-24 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2724: Summary: Metadata indexer not to emit empty values Key: NUTCH-2724 URL: https://issues.apache.org/jira/browse/NUTCH-2724 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2723) Indexer Solr not to decode URLs before deletion

2019-06-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2723: - Attachment: NUTCH-2723.patch > Indexer Solr not to decode URLs before delet

[jira] [Updated] (NUTCH-2710) Normalize before internal and external checks

2019-06-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2710: - Attachment: NUTCH-2710.patch > Normalize before internal and external che

[jira] [Created] (NUTCH-2723) Indexer Solr not to decode URLs before deletion

2019-06-19 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2723: Summary: Indexer Solr not to decode URLs before deletion Key: NUTCH-2723 URL: https://issues.apache.org/jira/browse/NUTCH-2723 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2585) NPE in TrieStringMatcher

2019-05-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833675#comment-16833675 ] Markus Jelsma commented on NUTCH-2585: -- That seems fine enough! +1 > NPE in TrieStringMatc

[jira] [Resolved] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

2019-05-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1625. -- Resolution: Won't Fix > IndexerMapReduce skips FETCH_NOTMODIF

[jira] [Commented] (NUTCH-1625) IndexerMapReduce skips FETCH_NOTMODIFIED

2019-05-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830966#comment-16830966 ] Markus Jelsma commented on NUTCH-1625: -- Closing this issue. For some reason, this patch doesn't

[jira] [Updated] (NUTCH-2585) NPE in TrieStringMatcher

2019-05-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2585: - Description: Stumbled on this one just now: {code} 2018-05-25 14:29:31,844 INFO [FetcherThread

[jira] [Updated] (NUTCH-2585) NPE in TrieStringMatcher

2019-05-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2585: - Description: Stumbled on this one just now: {code} 2018-05-25 14:29:31,844 INFO [FetcherThread

[jira] [Deleted] (NUTCH-2714) WHAT KINDS OF FUNCTION CAN FREELANCE ENGINEERS DO?

2019-04-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma deleted NUTCH-2714: - > WHAT KINDS OF FUNCTION CAN FREELANCE ENGINEERS

[jira] [Closed] (NUTCH-2714) WHAT KINDS OF FUNCTION CAN FREELANCE ENGINEERS DO?

2019-04-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2714. > WHAT KINDS OF FUNCTION CAN FREELANCE ENGINEERS

[jira] [Deleted] (NUTCH-2712) The digital landscape is ever changing. It exists in a constant state of evolution and revolution, driven by new innovations

2019-04-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma deleted NUTCH-2712: - > The digital landscape is ever changing. It exists in a constant state of > evo

[jira] [Closed] (NUTCH-2712) The digital landscape is ever changing. It exists in a constant state of evolution and revolution, driven by new innovations

2019-04-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2712. Resolution: Invalid > The digital landscape is ever changing. It exists in a constant st

[jira] [Deleted] (NUTCH-2711) The digital landscape is ever changing. It exists in a constant state of evolution and revolution, driven by new innovations

2019-04-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma deleted NUTCH-2711: - > The digital landscape is ever changing. It exists in a constant state of > evo

[jira] [Closed] (NUTCH-2711) The digital landscape is ever changing. It exists in a constant state of evolution and revolution, driven by new innovations

2019-04-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2711. Resolution: Invalid Closing spam! > The digital landscape is ever changing. It exi

[jira] [Updated] (NUTCH-2711) The digital landscape is ever changing. It exists in a constant state of evolution and revolution, driven by new innovations

2019-04-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2711: - Description: Field Engineer offers an online field technician marketplace that helps companies

[jira] [Created] (NUTCH-2710) Normalize before internal and external checks

2019-04-25 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2710: Summary: Normalize before internal and external checks Key: NUTCH-2710 URL: https://issues.apache.org/jira/browse/NUTCH-2710 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2704) Upgrade crawler-commons dependency to 1.0

2019-04-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816191#comment-16816191 ] Markus Jelsma commented on NUTCH-2704: -- +1 > Upgrade crawler-commons dependency to

[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815305#comment-16815305 ] Markus Jelsma commented on NUTCH-2703: -- remote: To git@github:apache/nutch.git remote:bf75e96

[jira] [Resolved] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2703. -- Resolution: Fixed Assignee: Markus Jelsma > parse-tika: Boilerpipe should not

[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815302#comment-16815302 ] Markus Jelsma commented on NUTCH-2703: -- Thanks for not missing both MIME types, text/html

[jira] [Updated] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2703: - Priority: Minor (was: Critical) > parse-tika: Boilerpipe should not run for non-(X)HTML pa

[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-03-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799038#comment-16799038 ] Markus Jelsma commented on NUTCH-2703: -- This patch applies to the current Github master source

[jira] [Commented] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-03-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799037#comment-16799037 ] Markus Jelsma commented on NUTCH-2701: -- +1 > Fetcher: log dates and times also in human-reada

[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-03-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797051#comment-16797051 ] Markus Jelsma commented on NUTCH-2703: -- patch for master > parse-tika: Boilerpipe should not

[jira] [Updated] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-03-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2703: - Attachment: NUTCH-2703.patch > parse-tika: Boilerpipe should not run for non-(X)HTML pa

[jira] [Commented] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775316#comment-16775316 ] Markus Jelsma commented on NUTCH-2692: -- Aargh it did! I had a fight with Git, this happened. I'll

[jira] [Resolved] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2692. -- Resolution: Fixed 78af89f2..0085ee74 master -> master Thanks Sebastian! > Subcoll

[jira] [Updated] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2692: - Attachment: NUTCH-2692.patch > Subcollection to support case-insensitive white and black li

[jira] [Commented] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775278#comment-16775278 ] Markus Jelsma commented on NUTCH-2692: -- That was missing indeed! Thanks Sebastian! > Subcollect

[jira] [Commented] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775115#comment-16775115 ] Markus Jelsma commented on NUTCH-2694: -- Hmm, yes. Why didn't i notice that? Anyway, updated patch

[jira] [Commented] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775119#comment-16775119 ] Markus Jelsma commented on NUTCH-2694: -- Committed da8f3f52..33922feb master -> master Tha

[jira] [Commented] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774986#comment-16774986 ] Markus Jelsma commented on NUTCH-2692: -- I will commit this one shortly unless objections

[jira] [Commented] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774982#comment-16774982 ] Markus Jelsma commented on NUTCH-2694: -- I see, i never made a patch, or i lost it. Anyway, attached

[jira] [Updated] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2694: - Attachment: NUTCH-2694.patch > HostDB to aggregate by long instead of inte

[jira] [Created] (NUTCH-2694) HostDB to aggregate by long instead of integer

2019-02-11 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2694: Summary: HostDB to aggregate by long instead of integer Key: NUTCH-2694 URL: https://issues.apache.org/jira/browse/NUTCH-2694 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-01-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2692: - Attachment: NUTCH-2692.patch > Subcollection to support case-insensitive white and black li

[jira] [Created] (NUTCH-2692) Subcollection to support case-insensitive white and black lists

2019-01-28 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2692: Summary: Subcollection to support case-insensitive white and black lists Key: NUTCH-2692 URL: https://issues.apache.org/jira/browse/NUTCH-2692 Project: Nutch

[jira] [Commented] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748783#comment-16748783 ] Markus Jelsma commented on NUTCH-2689: -- Nice catch! It is always nice to see low hanging fruit like

[jira] [Commented] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746241#comment-16746241 ] Markus Jelsma commented on NUTCH-2678: -- Great! remote: To git@github:apache/nutch.git remote

[jira] [Resolved] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2678. -- Resolution: Fixed > Allow for per-host configurable protocol plu

[jira] [Commented] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746144#comment-16746144 ] Markus Jelsma commented on NUTCH-2687: -- Thanks! > Regex for reading title from Content-Disposit

[jira] [Commented] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746143#comment-16746143 ] Markus Jelsma commented on NUTCH-2678: -- Alright, so https://patch-diff.githubusercontent.com/raw

[jira] [Created] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-16 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2687: Summary: Regex for reading title from Content-Disposition is wrong Key: NUTCH-2687 URL: https://issues.apache.org/jira/browse/NUTCH-2687 Project: Nutch

[jira] [Updated] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2687: - Attachment: NUTCH-2687.patch > Regex for reading title from Content-Disposition is wr

[jira] [Updated] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2687: - Description: Given URL: https://www.amuse-project.org/file/download/default

[jira] [Commented] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737087#comment-16737087 ] Markus Jelsma commented on NUTCH-2678: -- Alright! I added support for protocol:http.., i couldn't

[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2678: - Attachment: NUTCH-2647.patch > Allow for per-host configurable protocol plu

[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin

2019-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2678: - Attachment: (was: NUTCH-2647.patch) > Allow for per-host configurable protocol plu

[jira] [Commented] (NUTCH-2673) EOFException protocol-http

2019-01-07 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735692#comment-16735692 ] Markus Jelsma commented on NUTCH-2673: -- Yes, thanks Sebastian! > EOFException protocol-h

[jira] [Closed] (NUTCH-2673) EOFException protocol-http

2019-01-07 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2673. Resolution: Not A Problem > EOFException protocol-h

[jira] [Commented] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-12-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730203#comment-16730203 ] Markus Jelsma commented on NUTCH-2665: -- Thanks! > Upgrade to Apache Tika 1.1

[jira] [Closed] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-12-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2665. > Upgrade to Apache Tika 1.19.1 > - > > Key

[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2678: - Description: Introduces new configuration file for mapping protocol plugins to hostnames. {code

[jira] [Commented] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716854#comment-16716854 ] Markus Jelsma commented on NUTCH-2678: -- Hello Sebastian! * it is indeed. I was in a hurry

[jira] [Commented] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716865#comment-16716865 ] Markus Jelsma commented on NUTCH-2678: -- Updated patch to include configuration file template. I

[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2678: - Attachment: NUTCH-2678.patch > Allow for per-host configurable protocol plu

[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2678: - Attachment: NUTCH-2678.patch > Allow for per-host configurable protocol plu

[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2678: - Attachment: NUTCH-2678.patch > Allow for per-host configurable protocol plu

[jira] [Created] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-10 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2678: Summary: Allow for per-host configurable protocol plugin Key: NUTCH-2678 URL: https://issues.apache.org/jira/browse/NUTCH-2678 Project: Nutch Issue Type

Force specific protocol plugin for set of hosts

2018-12-09 Thread Markus Jelsma
Hello, We need a configurable set of hosts to use a specific protocol plugin. There are several hacks i can of on how to achieve this. I am asking here to see if any of you have a good suggestion. Thanks, Markus

RE: Maven vs Gradle for Nutch Build System

2018-11-29 Thread Markus Jelsma
Hello Lewis! I would applaud for having a Mavenized build for Nutch! If i remember right, there was a ticket for this, is it not? I do not seem to be able to find it right away. Regards, Markus -Original message- > From:lewis john mcgibbney > Sent: Thursday 29th November 2018

[jira] [Commented] (NUTCH-2675) Give parsers the capability to read and write CrawlDatum

2018-11-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688210#comment-16688210 ] Markus Jelsma commented on NUTCH-2675: -- Well, what we could do is override ParseUtil.parse and add

[jira] [Created] (NUTCH-2673) EOFException protocol-http

2018-11-07 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2673: Summary: EOFException protocol-http Key: NUTCH-2673 URL: https://issues.apache.org/jira/browse/NUTCH-2673 Project: Nutch Issue Type: Bug Affects

[jira] [Commented] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662234#comment-16662234 ] Markus Jelsma commented on NUTCH-2665: -- On my machine it really fails with the latest patch, weird

[jira] [Commented] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661983#comment-16661983 ] Markus Jelsma commented on NUTCH-2665: -- Helloe [~axr], yes it compiles fine, that is where

[jira] [Commented] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660625#comment-16660625 ] Markus Jelsma commented on NUTCH-2665: -- I'll commit this one later today, if i don't forget, unless

[jira] [Updated] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2665: - Attachment: NUTCH-2665.patch > Upgrade to Apache Tika 1.1

<    1   2   3   4   5   6   7   8   9   10   >