[jira] [Updated] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-08-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2335: - Attachment: Injector.java > Injector not to filter and normalize existing URLs in Craw

[jira] [Commented] (NUTCH-2378) ChildFirst plugin classloader

2017-08-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127866#comment-16127866 ] Markus Jelsma commented on NUTCH-2378: -- Glad to see the solution for this ugly bastard get verified

[jira] [Resolved] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2368. -- Resolution: Fixed > Variable generate.max.count and fetcher.server.de

[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101558#comment-16101558 ] Markus Jelsma commented on NUTCH-2368: -- Committed to master in 2de30d2e..44f7ad97 master -> mas

[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101557#comment-16101557 ] Markus Jelsma commented on NUTCH-2368: -- Hahahaha sure! Thanks Sebastian! > Varia

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Of course, good point. Final patch, if still something is wrong

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Good point! Updated patch. > Variable generate.max.co

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Good points! Updated patch! > Variable generate.max.co

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Removed some files that didnt belong in the patch. > Varia

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Updated patch to fix NUTCH-2404. > Variable generate.max.co

[jira] [Closed] (NUTCH-2402) Fetcher variable missing for generate.max.count.expr and fetcher.server.delay.expr

2017-07-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2402. Resolution: Not A Problem Oh crap, i thought i had already committed NUTCH-2368, apparently

[jira] [Created] (NUTCH-2402) Fetcher variable missing for generate.max.count.expr and fetcher.server.delay.expr

2017-07-20 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2402: Summary: Fetcher variable missing for generate.max.count.expr and fetcher.server.delay.expr Key: NUTCH-2402 URL: https://issues.apache.org/jira/browse/NUTCH-2402

[jira] [Resolved] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1465. -- Resolution: Fixed remote:2dc7472..8f556f4 8f556f4a87d87edb96fb575fa4b579e39d9dfdb4

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092945#comment-16092945 ] Markus Jelsma commented on NUTCH-1465: -- Crap! I was probably looking without seeing! Got

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092909#comment-16092909 ] Markus Jelsma commented on NUTCH-1465: -- Sebastian, your patch has CrawlDatum

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090460#comment-16090460 ] Markus Jelsma commented on NUTCH-1465: -- Thanks! Will grab 202.patch and see if it fits tomorrow

[jira] [Assigned] (NUTCH-1465) Support sitemaps in Nutch

2017-07-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1465: Assignee: Markus Jelsma (was: Lewis John McGibbney) > Support sitemaps in Nu

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-07 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078124#comment-16078124 ] Markus Jelsma commented on NUTCH-1465: -- I think this is committable, anyone to disagree? If not, i'll

[jira] [Commented] (NUTCH-2397) Parser to add paragraph line breaks

2017-07-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074740#comment-16074740 ] Markus Jelsma commented on NUTCH-2397: -- Thanks Sebastian! > Parser to add paragraph line bre

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2017-07-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Attachment: NUTCH-1465.patch Anyway, here's the patch with crawler-commons 0.8. > Supp

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074554#comment-16074554 ] Markus Jelsma commented on NUTCH-1465: -- Hi Lewis, 0.8 doesn't deal with this sitemap at autotrader

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073338#comment-16073338 ] Markus Jelsma commented on NUTCH-1465: -- Ah, i see. The autotrader sitemap points to an index

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-04 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073288#comment-16073288 ] Markus Jelsma commented on NUTCH-1465: -- Hello Lewis, I am positive i took the latest pieces

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070485#comment-16070485 ] Markus Jelsma commented on NUTCH-1465: -- Hi Lewis! It appears to be working fine now and bug-free due

[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070323#comment-16070323 ] Markus Jelsma edited comment on NUTCH-1465 at 6/30/17 3:58 PM: --- Ah, removing

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Attachment: NUTCH-1465.patch Ah, removing the NULL check in the reducer solves the problem

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070314#comment-16070314 ] Markus Jelsma commented on NUTCH-1465: -- There is an oddity going on when a sitemap.xml entry

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Attachment: NUTCH-1465.patch Updated patch: * corrected implementation for not overwriting

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Attachment: NUTCH-1465.patch Updated patch for trunk: * added some curly braces to if statements

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2017-06-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062909#comment-16062909 ] Markus Jelsma commented on NUTCH-2184: -- Any progress on this one? > Enable IndexingJob to funct

RE: [ANNOUNCEMENT] Welcome Blackice as new Nutch PMC and Committer

2017-06-14 Thread Markus Jelsma
Welcome! -Original message- > From:lewis john mcgibbney > Sent: Wednesday 14th June 2017 16:42 > To: dev@nutch.apache.org; u...@nutch.apache.org > Subject: [ANNOUNCEMENT] Welcome Blackice as new Nutch PMC and Committer > > Hi Folks, > The Nutch PMC recently VOTEd

[jira] [Updated] (NUTCH-2386) BasicURLNormalizer does not encode curly braces

2017-05-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2386: - Attachment: NUTCH-2386.patch Patch for trunk > BasicURLNormalizer does not encode curly bra

[jira] [Created] (NUTCH-2386) BasicURLNormalizer does not encode curly braces

2017-05-15 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2386: Summary: BasicURLNormalizer does not encode curly braces Key: NUTCH-2386 URL: https://issues.apache.org/jira/browse/NUTCH-2386 Project: Nutch Issue Type

[jira] [Updated] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2017-05-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2382: - Fix Version/s: 1.14 > indexer-hbase Nutch 1.x bra

[jira] [Updated] (NUTCH-2380) indexer-elastic version bump

2017-05-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2380: - Fix Version/s: 1.14 > indexer-elastic version b

[jira] [Updated] (NUTCH-2378) ChildFirst plugin classloader

2017-05-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2378: - Fix Version/s: 1.14 > ChildFirst plugin classloa

[jira] [Commented] (NUTCH-2378) ChildFirst plugin classloader

2017-05-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15991597#comment-15991597 ] Markus Jelsma commented on NUTCH-2378: -- Patch passes all tests in a clean Nutch master checkout

RE: Nutch git/wiki

2017-05-01 Thread Markus Jelsma
Thanks! -Original message- > From:Daniel Gruno <humbed...@apache.org> > Sent: Monday 1st May 2017 23:25 > To: Markus Jelsma <markus.jel...@openindex.io>; Daniel Gruno > <humbed...@apache.org>; dev@nutch.apache.org > Cc: infrastruct...@apache.

RE: Nutch git/wiki

2017-05-01 Thread Markus Jelsma
-Original message- > From:Daniel Gruno <humbed...@apache.org> > Sent: Monday 1st May 2017 23:11 > To: Markus Jelsma <markus.jel...@openindex.io>; dev@nutch.apache.org > Cc: infrastruct...@apache.org > Subject: Re: Nutch git/wiki > > On 05/01/2017 11:08 PM, Mar

Nutch git/wiki

2017-05-01 Thread Markus Jelsma
Hello, Nutch' wiki on git doesn't appear to work, repo cannot be found when checking out. Using ASF committer page and substitute project name with nutch also doesn't find a repo. Any ideas? Thanks, Markus

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2017-04-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967410#comment-15967410 ] Markus Jelsma commented on NUTCH-1932: -- Probably, what do you suggest? > Automatically rem

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967406#comment-15967406 ] Markus Jelsma commented on NUTCH-2335: -- I can't see it too, it doesn't match with my compiled sources

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2017-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964262#comment-15964262 ] Markus Jelsma commented on NUTCH-1932: -- We have this latest patch running for over six months now

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964224#comment-15964224 ] Markus Jelsma commented on NUTCH-2335: -- Yes, it still filters/normalizes. Although it is not obvious

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964205#comment-15964205 ] Markus Jelsma commented on NUTCH-2335: -- Don't know how i changed the assignee. Anyway, our injector

[jira] [Assigned] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2335: Assignee: Sebastian Nagel (was: Markus Jelsma) > Injector not to filter and normal

[jira] [Assigned] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2335: Assignee: Markus Jelsma (was: Sebastian Nagel) > Injector not to filter and normal

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962762#comment-15962762 ] Markus Jelsma commented on NUTCH-2335: -- Sebastian, filtering and normalizing is still enabled

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958941#comment-15958941 ] Markus Jelsma commented on NUTCH-2335: -- Thanks! > Injector not to filter and normalize existing U

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958906#comment-15958906 ] Markus Jelsma commented on NUTCH-2335: -- Ah, making it a patch file: https://github.com/apache/nutch

[jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb

2017-04-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958893#comment-15958893 ] Markus Jelsma commented on NUTCH-2335: -- Ah, i didn't see this one. Anyone knows how i can turn a pull

[jira] [Closed] (NUTCH-2371) Injector to support noFilter and noNormalize

2017-04-06 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2371. Resolution: Duplicate > Injector to support noFilter and noNormal

[jira] [Created] (NUTCH-2371) Injector to support noFilter and noNormalize

2017-04-06 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2371: Summary: Injector to support noFilter and noNormalize Key: NUTCH-2371 URL: https://issues.apache.org/jira/browse/NUTCH-2371 Project: Nutch Issue Type: Bug

RE: [VOTE] Release Apache Nutch 1.13 RC#1

2017-03-29 Thread Markus Jelsma
Looks good to me! +1 Thanks Lewis! Markus -Original message- > From:lewis john mcgibbney > Sent: Wednesday 29th March 2017 6:20 > To: u...@nutch.apache.org; dev@nutch.apache.org > Subject: [VOTE] Release Apache Nutch 1.13 RC#1 > > Hi Folks, > > A first candidate

[jira] [Commented] (NUTCH-2247) Protocol resolver

2017-03-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936750#comment-15936750 ] Markus Jelsma commented on NUTCH-2247: -- Ah i forgot about this thing. It is not ready indeed and we

[jira] [Commented] (NUTCH-2212) Decrease memory consumption by tuning stack size

2017-03-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936741#comment-15936741 ] Markus Jelsma commented on NUTCH-2212: -- Yes i agree. In local mode it is usually not a problem

RE: GSOC2017: Anybody is mentoring and is interested in improving Solr integration

2017-03-21 Thread Markus Jelsma
ces for Solr users, new and experienced > > > On 21 March 2017 at 16:21, Markus Jelsma <markus.jel...@openindex.io> wrote: > > Hello Alexandre - we already have a Solr indexing plugin. There are > > probably some bugs but we are happily indexing from Nutch to SolrCloud. &

RE: GSOC2017: Anybody is mentoring and is interested in improving Solr integration

2017-03-21 Thread Markus Jelsma
And in case you are already aware of it, do you have new ideas of further Nutch/Solr integration? That would be interesting. In case of using Solr as a datasource, i believe Nutch 2.x via Apache Gora already integrates with it. Regards, Markus -Original message- > From:Markus

RE: GSOC2017: Anybody is mentoring and is interested in improving Solr integration

2017-03-21 Thread Markus Jelsma
Hello Alexandre - we already have a Solr indexing plugin. There are probably some bugs but we are happily indexing from Nutch to SolrCloud. Markus -Original message- > From:Alexandre Rafalovitch > Sent: Tuesday 21st March 2017 19:14 > To: dev@nutch.apache.org >

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Updated patch. Delay is not also set on minCrawlDelay to make

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Description: In some cases we need to use host specific characteristics in determining crawl

[jira] [Resolved] (NUTCH-2068) Allow subcollection overrides via metadata

2017-03-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2068. -- Resolution: Fixed Committed to be3aea1..9fb7d6c master -> master > Allow subcoll

[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927821#comment-15927821 ] Markus Jelsma commented on NUTCH-2368: -- Any thought on this patch? > Variable generate.max.co

[jira] [Resolved] (NUTCH-2367) Get single record from HostDB

2017-03-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2367. -- Resolution: Fixed Committed to 3926910..be3aea1 master -> master > Get single recor

[jira] [Closed] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2366. > Deprecated Job constructor in hostdb/ReadHostDb.j

[jira] [Commented] (NUTCH-2068) Allow subcollection overrides via metadata

2017-03-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926027#comment-15926027 ] Markus Jelsma commented on NUTCH-2068: -- Found this little thing when sifting through some patches

[jira] [Commented] (NUTCH-2367) Get single record from HostDB

2017-03-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926024#comment-15926024 ] Markus Jelsma commented on NUTCH-2367: -- Will commit shortly unless objections > Get single rec

[jira] [Resolved] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2366. -- Resolution: Fixed Committed to 6d47e14..3926910 master -> master Thanks Omkar!! > Depr

[jira] [Assigned] (NUTCH-2068) Allow subcollection overrides via metadata

2017-03-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2068: Assignee: Markus Jelsma > Allow subcollection overrides via metad

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Now this is odd, had to make this change but had it running

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch New patch. Removed system.out > Variable generate.max.co

[jira] [Created] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2368: Summary: Variable generate.max.count and fetcher.server.delay Key: NUTCH-2368 URL: https://issues.apache.org/jira/browse/NUTCH-2368 Project: Nutch Issue

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Patch for trunk! > Variable generate.max.co

[jira] [Commented] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-03-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15924523#comment-15924523 ] Markus Jelsma commented on NUTCH-2363: -- Thanks Sebastian - i will address your remarks later

[jira] [Created] (NUTCH-2367) Get single record from HostDB

2017-03-14 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2367: Summary: Get single record from HostDB Key: NUTCH-2367 URL: https://issues.apache.org/jira/browse/NUTCH-2367 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2367) Get single record from HostDB

2017-03-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2367: - Attachment: NUTCH-2367.patch Patch for trunk! > Get single record from Hos

[jira] [Assigned] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2366: Assignee: Markus Jelsma > Deprecated Job constructor in hostdb/ReadHostDb.j

[jira] [Updated] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2366: - Affects Version/s: (was: 2.2.1) > Deprecated Job constructor in hostdb/ReadHostDb.j

[jira] [Updated] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2366: - Fix Version/s: 1.13 > Deprecated Job constructor in hostdb/ReadHostDb.j

[jira] [Commented] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-03-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889994#comment-15889994 ] Markus Jelsma commented on NUTCH-2363: -- Hello Julien, NUTCH-2355 dealt with sending cookies

[jira] [Updated] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-02-28 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2363: - Attachment: NUTCH-2363.patch Patch for trunk! > Fetcher support for reading and setting cook

[jira] [Created] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-02-28 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2363: Summary: Fetcher support for reading and setting cookies Key: NUTCH-2363 URL: https://issues.apache.org/jira/browse/NUTCH-2363 Project: Nutch Issue Type

[jira] [Resolved] (NUTCH-2355) Protocol plugins to set cookie if Cookie metadata field is present

2017-02-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2355. -- Resolution: Fixed remote: nutch git commit: NUTCH-2355 Protocol plugins to set cookie if Cookie

[jira] [Commented] (NUTCH-2355) Protocol plugins to set cookie if Cookie metadata field is present

2017-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867779#comment-15867779 ] Markus Jelsma commented on NUTCH-2355: -- I'll commit this one soon unless objections. > Proto

[jira] [Closed] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-02-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2359. > Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-for

[jira] [Resolved] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-02-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2359. -- Resolution: Fixed Assignee: Markus Jelsma Committed to master in 76aedcb..9a9c4b3. Thank

[jira] [Commented] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-02-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864360#comment-15864360 ] Markus Jelsma commented on NUTCH-2359: -- Hello [~laknath] - the label patch is attachted

[jira] [Updated] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-02-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2359: - Fix Version/s: 1.13 > Parsefilter-regex raises IndexOutOfBoundsException when rules are

[jira] [Resolved] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2017-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2229. -- Resolution: Fixed Well, it seems i am working in the wrong directory! Sorry for the fuss

[jira] [Closed] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2017-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2229. > Allow Jexl expressions on CrawlDatum's fixed attribu

[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2017-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2229: - Fix Version/s: (was: 1.13) 1.12 > Allow Jexl expressions on CrawlDatu

[jira] [Comment Edited] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2017-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851321#comment-15851321 ] Markus Jelsma edited comment on NUTCH-2229 at 2/3/17 10:55 AM: --- This commit

[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2017-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2229: - Fix Version/s: (was: 1.12) 1.13 > Allow Jexl expressions on CrawlDatu

[jira] [Reopened] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2017-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-2229: -- This commit seems to have disappeared from git master. Although the SVN diff is very clear

[jira] [Commented] (NUTCH-2349) urlnormalizer-basic NPE for ill-formed URL "http:/"

2017-02-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848358#comment-15848358 ] Markus Jelsma commented on NUTCH-2349: -- Thanks! > urlnormalizer-basic NPE for ill-formed URL &q

[jira] [Commented] (NUTCH-2355) Protocol plugins to set cookie if Cookie metadata field is present

2017-02-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848339#comment-15848339 ] Markus Jelsma commented on NUTCH-2355: -- Hello Sebastian, # right now we can only pass the Cookie via

[jira] (NUTCH-2355) Protocol plugins to set cookie if Cookie metadata field is present

2017-01-31 Thread Markus Jelsma (JIRA)
Title: Message Title Markus Jelsma updated an issue

[jira] (NUTCH-2355) Protocol plugins to set cookie if Cookie metadata field is present

2017-01-31 Thread Markus Jelsma (JIRA)
Title: Message Title Markus Jelsma created an issue

[jira] [Updated] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.3

2017-01-20 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2354: - Description: This wednesday we experienced trouble running the 1.12 injector on Hadoop 2.7.3. We

[jira] [Created] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.3

2017-01-20 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2354: Summary: Upgrade Hadoop dependencies to 2.7.3 Key: NUTCH-2354 URL: https://issues.apache.org/jira/browse/NUTCH-2354 Project: Nutch Issue Type: Bug

<    1   2   3   4   5   6   7   8   9   10   >