[jira] [Commented] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660525#comment-16660525 ] Markus Jelsma commented on NUTCH-2665: -- Updated patch defining the property in ivysettings.xml

[jira] [Commented] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660455#comment-16660455 ] Markus Jelsma commented on NUTCH-2665: -- Patch for 2.x! > Upgrade to Apache Tika 1.1

[jira] [Updated] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2665: - Attachment: NUTCH-2665.patch > Upgrade to Apache Tika 1.1

[jira] [Created] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-23 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2665: Summary: Upgrade to Apache Tika 1.19.1 Key: NUTCH-2665 URL: https://issues.apache.org/jira/browse/NUTCH-2665 Project: Nutch Issue Type: Task

[jira] [Commented] (NUTCH-2651) Upgrade to Tika 1.19.1 (from 1.18)

2018-10-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658018#comment-16658018 ] Markus Jelsma commented on NUTCH-2651: -- [~wastl-nagel] i can feel the sorrow. I was just about

[jira] [Commented] (NUTCH-2651) Upgrade to Tika 1.19.1 (from 1.18)

2018-10-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655133#comment-16655133 ] Markus Jelsma commented on NUTCH-2651: -- +1 also thanks for finding the javax-ws fix, i could

[jira] [Commented] (NUTCH-2625) ProtocolFactory.getProtocol(url) may create multiple plugin instances

2018-10-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650282#comment-16650282 ] Markus Jelsma commented on NUTCH-2625: -- Seems reasonable, +1 > ProtocolFactory.getProtocol(url)

[jira] [Commented] (NUTCH-2186) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2018-10-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646261#comment-16646261 ] Markus Jelsma commented on NUTCH-2186: -- [~asm123] please open a new ticket > -addBinaryContent f

[jira] [Commented] (NUTCH-2192) Get rid of oro

2018-10-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643927#comment-16643927 ] Markus Jelsma commented on NUTCH-2192: -- Nice! I completely forgot these ancient issues. Thanks

[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642971#comment-16642971 ] Markus Jelsma commented on NUTCH-2648: -- I misread the patch regarding the other plugins. So +1

[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642463#comment-16642463 ] Markus Jelsma commented on NUTCH-2648: -- +1! Although i would suggest to mention it works only

[jira] [Resolved] (NUTCH-2647) Skip TLS certificate checks in protocol-http plugin

2018-09-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2647. -- Resolution: Fixed Committed To https://gitbox.apache.org/repos/asf/nutch.git 9d59538c

[jira] [Commented] (NUTCH-2647) Skip TLS certificate checks in protocol-http plugin

2018-09-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631599#comment-16631599 ] Markus Jelsma commented on NUTCH-2647: -- To confirm, protocol-httpclient also by default ignores self

[jira] [Commented] (NUTCH-2647) Skip TLS certificate checks in protocol-http plugin

2018-09-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631038#comment-16631038 ] Markus Jelsma commented on NUTCH-2647: -- Hello Sebastian, My own implementation of X509TrustManager

[jira] [Commented] (NUTCH-2623) Fetcher to guarantee delay for same host/domain/ip independent of http/https protocol

2018-09-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631040#comment-16631040 ] Markus Jelsma commented on NUTCH-2623: -- +1! Thanks Sebastian! > Fetcher to guarantee de

[jira] [Updated] (NUTCH-2647) Skip TLS certificate checks in protocol-http

2018-09-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2647: - Summary: Skip TLS certificate checks in protocol-http (was: Support for dummy X509 trust

[jira] [Updated] (NUTCH-2647) Skip TLS certificate checks in protocol-http plugin

2018-09-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2647: - Summary: Skip TLS certificate checks in protocol-http plugin (was: Skip TLS certificate checks

***UNCHECKED*** [jira] [Commented] (NUTCH-2647) Support for dummy X509 trust manager

2018-09-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620448#comment-16620448 ] Markus Jelsma commented on NUTCH-2647: -- patch for 1.15 source > Support for dummy X509 tr

[jira] [Updated] (NUTCH-2647) Support for dummy X509 trust manager

2018-09-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2647: - Attachment: NUTCH-2647.patch > Support for dummy X509 trust mana

[jira] [Created] (NUTCH-2647) Support for dummy X509 trust manager

2018-09-19 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2647: Summary: Support for dummy X509 trust manager Key: NUTCH-2647 URL: https://issues.apache.org/jira/browse/NUTCH-2647 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-2623) Fetcher to guarantee delay for same host/domain/ip independent of http/https protocol

2018-09-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613322#comment-16613322 ] Markus Jelsma commented on NUTCH-2623: -- +1, however, i would not have expected a byHostProtocol

[jira] [Created] (NUTCH-2630) Fetcher to log skipped records by robots.txt

2018-08-01 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2630: Summary: Fetcher to log skipped records by robots.txt Key: NUTCH-2630 URL: https://issues.apache.org/jira/browse/NUTCH-2630 Project: Nutch Issue Type

RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
However, the test crawl ran/runs fine, in the background, no errors. But just now, watching the fetcher, i noticed the crawl delay is not always respected. The only configuration change i have is the http.agent.* directives to run. 2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread -

RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
All tests pass, crawler run fine so far, +1 for 1.15! Regards, Markus -Original message- > From:Sebastian Nagel > Sent: Thursday 26th July 2018 17:05 > To: u...@nutch.apache.org > Cc: dev@nutch.apache.org > Subject: [VOTE] Release Apache Nutch 1.15 RC#1 > > Hi Folks, > > A first

[jira] [Comment Edited] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554232#comment-16554232 ] Markus Jelsma edited comment on NUTCH-2612 at 7/24/18 1:24 PM: --- Updated

[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554232#comment-16554232 ] Markus Jelsma commented on NUTCH-2612: -- Updated patch: * logging when a hostname is processed

[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532712#comment-16532712 ] Markus Jelsma commented on NUTCH-2612: -- New patch! > Support for sitemap processing by hostn

[jira] [Updated] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2612: - Attachment: NUTCH-2612.patch > Support for sitemap processing by hostn

[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader

2018-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532699#comment-16532699 ] Markus Jelsma commented on NUTCH-2614: -- Yes! > NPE in CrawlDbRea

[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532691#comment-16532691 ] Markus Jelsma commented on NUTCH-2612: -- Yes of course! Will upload new patch! > Support for site

[jira] [Comment Edited] (NUTCH-2614) NPE in CrawlDbReader

2018-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532446#comment-16532446 ] Markus Jelsma edited comment on NUTCH-2614 at 7/4/18 9:26 AM: -- -Really

[jira] [Commented] (NUTCH-2614) NPE in CrawlDbReader

2018-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532446#comment-16532446 ] Markus Jelsma commented on NUTCH-2614: -- Really? In that case my patch for NUTCH-2612 is probably

[jira] [Created] (NUTCH-2614) NPE in CrawlDbReader

2018-07-03 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2614: Summary: NPE in CrawlDbReader Key: NUTCH-2614 URL: https://issues.apache.org/jira/browse/NUTCH-2614 Project: Nutch Issue Type: Bug Components

[jira] [Updated] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2612: - Attachment: NUTCH-2612.patch > Support for sitemap processing by hostn

[jira] [Commented] (NUTCH-2612) Support for sitemap processing by hostname

2018-07-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531253#comment-16531253 ] Markus Jelsma commented on NUTCH-2612: -- Patch for master! > Support for sitemap process

[jira] [Created] (NUTCH-2612) Support for sitemap processing by hostname

2018-06-26 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2612: Summary: Support for sitemap processing by hostname Key: NUTCH-2612 URL: https://issues.apache.org/jira/browse/NUTCH-2612 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-06-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517997#comment-16517997 ] Markus Jelsma commented on NUTCH-2606: -- Ah, this is interesting. Nutch indeed believes it is a Word

[jira] [Updated] (NUTCH-2597) NPE in updatehostdb

2018-06-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2597: - Description: I get an NPE on updatehostdb. I start with a clean crawlDB & hostDB. A

RE: Nutch 1.14 issues

2018-06-13 Thread Markus Jelsma
Ah, wrong thread. But it seems some things are not entirely right for 1.15 release just yet. Markus -Original message- > From:Markus Jelsma > Sent: Wednesday 13th June 2018 12:44 > To: dev@nutch.apache.org > Subject: RE: Nutch 1.14 issues > > Hi, > > I've got some tests failing

RE: Nutch 1.14 issues

2018-06-13 Thread Markus Jelsma
Hi, I've got some tests failing here on a vanilla master check out. [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.314 sec [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED Jurian had protocol-http's test failing just now, but running ant test on my

[jira] [Commented] (NUTCH-2416) Fetcher to log thread ID

2018-06-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503049#comment-16503049 ] Markus Jelsma commented on NUTCH-2416: -- Thanks! > Fetcher to log thread

[jira] [Closed] (NUTCH-2416) Fetcher to log thread ID

2018-06-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2416. > Fetcher to log thread ID > > > Key

[jira] [Created] (NUTCH-2585) NPE in TrieStringMatcher

2018-05-25 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2585: Summary: NPE in TrieStringMatcher Key: NUTCH-2585 URL: https://issues.apache.org/jira/browse/NUTCH-2585 Project: Nutch Issue Type: Bug Affects Versions

[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2018-04-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454287#comment-16454287 ] Markus Jelsma commented on NUTCH-2573: -- Sounds like a good idea! > Suspend crawling if robots.

[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453826#comment-16453826 ] Markus Jelsma commented on NUTCH-1228: -- Wow, this is ancient! Thanks! > Change mapred.task.time

[jira] [Commented] (NUTCH-2572) HostDb: updatehostdb does not set values

2018-04-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449722#comment-16449722 ] Markus Jelsma commented on NUTCH-2572: -- +1 > HostDb: updatehostdb does not set val

[jira] [Commented] (NUTCH-2547) urlnormalizer-basic fails on special characters in path/query

2018-03-29 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419027#comment-16419027 ] Markus Jelsma commented on NUTCH-2547: -- Hello Sebastian, option two sounds fine. > urlnormali

[jira] [Comment Edited] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407981#comment-16407981 ] Markus Jelsma edited comment on NUTCH-2541 at 3/21/18 3:17 PM

[jira] [Commented] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407981#comment-16407981 ] Markus Jelsma commented on NUTCH-2541: -- This is probably not a 1.14 problem, we fixed it some

[jira] [Resolved] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2018-03-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2411. -- Resolution: Fixed Committed for 1.15 bd70d2fe..9a77f437 master -> master > Index-me

[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2018-03-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391141#comment-16391141 ] Markus Jelsma commented on NUTCH-2411: -- Forgot the last time i threatened to commit, will try again

[jira] [Commented] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata

2018-03-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391139#comment-16391139 ] Markus Jelsma commented on NUTCH-2525: -- Any comments on this one? Julien did the initial work

[jira] [Updated] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata

2018-03-07 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2525: - Attachment: NUTCH-2525.patch > Metadata indexer cannot handle uppercase parse metad

[jira] [Created] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata

2018-03-07 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2525: Summary: Metadata indexer cannot handle uppercase parse metadata Key: NUTCH-2525 URL: https://issues.apache.org/jira/browse/NUTCH-2525 Project: Nutch Issue

RE: Nutch fails to compile...

2018-02-21 Thread Markus Jelsma
silly due to it being late > > On Wed, Feb 21, 2018 at 1:37 AM, BlackIce <blackice...@gmail.com > <mailto:blackice...@gmail.com>> wrote: > I commented out the date and now after a whole lot of warnings it says Build > Successful > > Im gonna take it for a short

RE: Nutch fails to compile...

2018-02-20 Thread Markus Jelsma
Hello, Well, this is interesting! Have you tried Java 8 instead? I don´t think 9 should cause these kinds of problems but i haven't tried it yet, but would like to know anyway. Regarding commenting out the date, try it anyway! Regards, Markus -Original message- > From:BlackIce

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347762#comment-16347762 ] Markus Jelsma commented on NUTCH-2466: -- Another note, curious to see browser developers allow over

[jira] [Comment Edited] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347762#comment-16347762 ] Markus Jelsma edited comment on NUTCH-2466 at 1/31/18 11:14 PM: Another

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347749#comment-16347749 ] Markus Jelsma commented on NUTCH-2466: -- Glad to hear this will work for you! > Sitemap proces

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347735#comment-16347735 ] Markus Jelsma commented on NUTCH-2466: -- Hello Moreno, Well, we obviously could allow a -1 setting

[jira] [Resolved] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2466. -- Resolution: Fixed > Sitemap processor to follow redire

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346862#comment-16346862 ] Markus Jelsma commented on NUTCH-2466: -- Thanks! remote: Sending notification emails to: ['"

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346768#comment-16346768 ] Markus Jelsma commented on NUTCH-2466: -- New patch! > Sitemap processor to follow redire

[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2466: - Attachment: NUTCH-2466.patch > Sitemap processor to follow redire

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346730#comment-16346730 ] Markus Jelsma commented on NUTCH-2466: -- Will commit shortly unless objections. > Sitemap proces

[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2018-01-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338290#comment-16338290 ] Markus Jelsma commented on NUTCH-2369: -- How is this different from the current WebGraph package which

[jira] [Commented] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335984#comment-16335984 ] Markus Jelsma commented on NUTCH-2503: -- Hmm, in the past you could run ant -f src/plugin/urlfilter

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335949#comment-16335949 ] Markus Jelsma commented on NUTCH-2466: -- First patch adding maxRedir configurable and filterNormalize

[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2466: - Attachment: NUTCH-2466.patch > Sitemap processor to follow redire

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328892#comment-16328892 ] Markus Jelsma commented on NUTCH-2466: -- Ah, crap yeah. Won't get back to this today. Hopefully later

[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326999#comment-16326999 ] Markus Jelsma commented on NUTCH-2496: -- Yes it makes a lot of sense to disable it everywhere except

[jira] [Comment Edited] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326999#comment-16326999 ] Markus Jelsma edited comment on NUTCH-2496 at 1/16/18 10:52 AM: Yes

[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325053#comment-16325053 ] Markus Jelsma commented on NUTCH-2496: -- If you use the same filters/normalizers everywhere in Nutch

[jira] [Commented] (NUTCH-2487) Fetcher thread stopped due to constraint violation

2017-12-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303688#comment-16303688 ] Markus Jelsma commented on NUTCH-2487: -- It seems Nutch and your plugin are using a different version

RE: [VOTE] Release Apache Nutch 1.14 RC#1

2017-12-19 Thread Markus Jelsma
the office. >   > I'm also against mentioning the open issues in the release notes, it's normal > to have open/unresolved issues before a release and we should focus only on > mentioning what was added/fixed, for the remaining issues we already have > Jira (which is public). >  

RE: [VOTE] Release Apache Nutch 1.14 RC#1

2017-12-19 Thread Markus Jelsma
I do not agree on mentioning those issues as unresolved in the release notes. They are known open issues, just as many others are known and open issues. There is no reason to mention these specific issues and not mentioning all the other open issues. Otherwise +1; Thanks Sebastian!

[jira] [Updated] (NUTCH-2485) ParserFactory swallows exception

2017-12-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2485: - Attachment: NUTCH-2485.patch Patch! > ParserFactory swallows except

[jira] [Created] (NUTCH-2485) ParserFactory swallows exception

2017-12-18 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2485: Summary: ParserFactory swallows exception Key: NUTCH-2485 URL: https://issues.apache.org/jira/browse/NUTCH-2485 Project: Nutch Issue Type: Bug Affects

[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294150#comment-16294150 ] Markus Jelsma commented on NUTCH-2478: -- Thanks! > // is not a valid base

[jira] [Resolved] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2320. -- Resolution: Duplicate > URLFilterChecker to run as TCP Telnet serv

[jira] [Closed] (NUTCH-2338) URLNormalizerChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2338. > URLNormalizerChecker to run as TCP Telnet serv

[jira] [Closed] (NUTCH-2478) // is not a valid base URL

2017-12-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2478. > // is not a valid base URL > -- > > Key

[jira] [Resolved] (NUTCH-2338) URLNormalizerChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2338. -- Resolution: Duplicate > URLNormalizerChecker to run as TCP Telnet serv

[jira] [Commented] (NUTCH-2338) URLNormalizerChecker to run as TCP Telnet service

2017-12-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294148#comment-16294148 ] Markus Jelsma commented on NUTCH-2338: -- Yes! > URLNormalizerChecker to run as TCP Telnet serv

[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292469#comment-16292469 ] Markus Jelsma commented on NUTCH-2439: -- Weird, i only got : Dec 15, 2017 1:45:42 PM

[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292421#comment-16292421 ] Markus Jelsma commented on NUTCH-2439: -- Note, since 1.17, all but one of the warnings are gone

[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292419#comment-16292419 ] Markus Jelsma commented on NUTCH-2478: -- I prefer your patch, it also carries a test

[jira] [Commented] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.3

2017-12-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290955#comment-16290955 ] Markus Jelsma commented on NUTCH-2354: -- Yes, i think we should include this. > Upgrade Had

[jira] [Commented] (NUTCH-2474) CrawlDbReader -stats fails with ClassCastException

2017-12-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290957#comment-16290957 ] Markus Jelsma commented on NUTCH-2474: -- +1 > CrawlDbReader -stats fails with ClassCastExcept

[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288300#comment-16288300 ] Markus Jelsma commented on NUTCH-2478: -- To clarify a bad sentence, i resolve the missing protocol

[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288289#comment-16288289 ] Markus Jelsma commented on NUTCH-2478: -- Yes, this needs a change in the parser plugins. I sought

[jira] [Updated] (NUTCH-2478) // is not a valid base URL

2017-12-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2478: - Description: This test fails: {code} @Test public void testBadResolver() throws Exception

[jira] [Created] (NUTCH-2478) // is not a valid base URL

2017-12-12 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2478: Summary: // is not a valid base URL Key: NUTCH-2478 URL: https://issues.apache.org/jira/browse/NUTCH-2478 Project: Nutch Issue Type: Bug Affects

RE: [DISCUSS] Release 1.14?

2017-12-12 Thread Markus Jelsma
Happy to hear. There are major improvements in Tika 1.17, it deals much better with some of the more extravagant web pages you find on the web. -Original message- > From:Sebastian Nagel > Sent: Tuesday 12th December 2017 13:36 > To: dev@nutch.apache.org >

RE: [DISCUSS] Release 1.14?

2017-12-08 Thread Markus Jelsma
Yes, please do :) -Original message- > From:BlackIce > Sent: Friday 8th December 2017 23:57 > To: dev@nutch.apache.org > Subject: Re: [DISCUSS] Release 1.14? > > OK, Ill test the RC  > > On Dec 8, 2017 11:54 PM, "Sebastian Nagel"

[jira] [Commented] (NUTCH-2472) Sitemap processor does not honour db.ignore.external.links

2017-12-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280285#comment-16280285 ] Markus Jelsma commented on NUTCH-2472: -- Yes probably. There are many sitemaps out there that link

[jira] [Commented] (NUTCH-2472) Sitemap processor does not honour db.ignore.external.links

2017-12-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280179#comment-16280179 ] Markus Jelsma commented on NUTCH-2472: -- Oh crap, you are right. It happened to us yesterday, but now

[jira] [Closed] (NUTCH-2472) Sitemap processor does not honour db.ignore.external.links

2017-12-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2472. Resolution: Not A Problem > Sitemap processor does not honour db.ignore.external.li

[jira] [Created] (NUTCH-2472) Sitemap processor does not honour db.ignore.external.links

2017-12-06 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2472: Summary: Sitemap processor does not honour db.ignore.external.links Key: NUTCH-2472 URL: https://issues.apache.org/jira/browse/NUTCH-2472 Project: Nutch

[jira] [Commented] (NUTCH-2470) CrawlDbReader -stats to show quantiles of score

2017-12-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277448#comment-16277448 ] Markus Jelsma commented on NUTCH-2470: -- Ah, you are using t-digest, very nice library indeed. +1

<    1   2   3   4   5   6   7   8   9   10   >