[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Patch Info: Patch Available > Automatically remove orphaned pa

[jira] [Commented] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083096#comment-15083096 ] Markus Jelsma commented on NUTCH-2178: -- Will commit in a few if no further objections

[jira] [Updated] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1449: - Patch Info: Patch Available > Optionally delete documents skipped by IndexingFilt

[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1186: - Patch Info: Patch Available > FreeGenerator always normali

[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2015-12-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2191: - Attachment: NUTCH-2191.patch Patch for trunk! Although all dependencies are correctly listed

[jira] [Created] (NUTCH-2191) Add protocol-htmlunit

2015-12-24 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2191: Summary: Add protocol-htmlunit Key: NUTCH-2191 URL: https://issues.apache.org/jira/browse/NUTCH-2191 Project: Nutch Issue Type: New Feature

[jira] [Created] (NUTCH-2192) Get rid of oro

2015-12-24 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2192: Summary: Get rid of oro Key: NUTCH-2192 URL: https://issues.apache.org/jira/browse/NUTCH-2192 Project: Nutch Issue Type: Task Reporter: Markus

[jira] [Updated] (NUTCH-2192) Get rid of oro

2015-12-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2192: - Attachment: NUTCH-2192.patch Patch for trunk. OutlinkExtractor is done. JsParsefilter left

[jira] [Resolved] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2189. -- Resolution: Fixed Committed to trunk in revision 1721615. Also updated CHANGES.txt for NUTCH

[jira] [Closed] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2189. > Domain filter must deactivate if no rules are pres

[jira] [Updated] (NUTCH-2190) Protocol normalizer

2015-12-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2190: - Attachment: NUTCH-2190.patch Patch for trunk. Tests pass. {code} # format: host\tprotocol\n

[jira] [Commented] (NUTCH-2065) Domain URL filter to support protocols

2015-12-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069421#comment-15069421 ] Markus Jelsma commented on NUTCH-2065: -- Agreed! > Domain URL filter to support protoc

[jira] [Closed] (NUTCH-2065) Domain URL filter to support protocols

2015-12-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2065. Resolution: Won't Fix > Domain URL filter to support protoc

[jira] [Created] (NUTCH-2190) Protocol normalizer

2015-12-23 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2190: Summary: Protocol normalizer Key: NUTCH-2190 URL: https://issues.apache.org/jira/browse/NUTCH-2190 Project: Nutch Issue Type: New Feature

[jira] [Commented] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069422#comment-15069422 ] Markus Jelsma commented on NUTCH-2189: -- Hello Sebastian - i do not think so actually

[jira] [Updated] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2189: - Patch Info: Patch Available > Domain filter must deactivate if no rules are pres

[jira] [Created] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-21 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2189: Summary: Domain filter must deactivate if no rules are present Key: NUTCH-2189 URL: https://issues.apache.org/jira/browse/NUTCH-2189 Project: Nutch Issue

[jira] [Updated] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2189: - Attachment: NUTCH-2189.patch Patch for trunk. Test passes. If, for any reason, there are zero

[jira] [Updated] (NUTCH-2065) Domain URL filter to support protocols

2015-12-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2065: - Attachment: NUTCH-2065.patch Updated patch to contain NUTCH-2189. Tests pass. > Domain

[jira] [Commented] (NUTCH-2188) While crawling with solr url (kerberos enabled) Error: org.apache.solr.common.SolrException: Unauthorized

2015-12-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063975#comment-15063975 ] Markus Jelsma commented on NUTCH-2188: -- Ah yes. You would need to patch SolrUtils.java in the indexer

[jira] [Commented] (NUTCH-2188) While crawling with solr url (kerberos enabled) Error: org.apache.solr.common.SolrException: Unauthorized

2015-12-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061883#comment-15061883 ] Markus Jelsma commented on NUTCH-2188: -- Solr has built-in security since juts a few versions

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060036#comment-15060036 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis - you can use the indexer-dummy in unit tests

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059750#comment-15059750 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis - keep in mind the possible configurations

[jira] [Comment Edited] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059750#comment-15059750 ] Markus Jelsma edited comment on NUTCH-2184 at 12/16/15 9:43 AM: Hello

[jira] [Closed] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-12-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1995. Resolution: Fixed Closing again. It seems there was a older nutch jar laying around. The plugin

[jira] [Comment Edited] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-12-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051126#comment-15051126 ] Markus Jelsma edited comment on NUTCH-1995 at 12/10/15 3:36 PM: Guys, we

[jira] [Reopened] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-12-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-1995: -- Guys, we upgraded to 1.11 but got these curious exceptions when running the crawler on Hadoop

[jira] [Updated] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2015-12-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1449: - Attachment: NUTCH-1449.patch Previous patch was wrong. > Optionally delete documents skip

RE: [RELEASE] Apache Nutch 1.11

2015-12-08 Thread Markus Jelsma
Nice! -Original message- From: lewis john mcgibbney Sent: Tuesday 8th December 2015 2:34 To: annou...@apache.org; u...@nutch.apache.org; dev@nutch.apache.org Subject: [RELEASE] Apache Nutch 1.11 Hello Folks, 07 December 2015 - Nutch 1.11 Release The Apache Nutch

[jira] [Updated] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2015-12-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1449: - Attachment: NUTCH-1449.patch Patch for trunk again, 1.12 > Optionally delete documents skip

[jira] [Resolved] (NUTCH-2176) Clean up of log4j.properties

2015-12-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2176. -- Resolution: Fixed Committed to trunk in rev. 1717622. > Clean up of log4j.propert

[jira] [Updated] (NUTCH-2176) Clean up of log4j.properties

2015-12-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2176: - Summary: Clean up of log4j.properties (was: log4j.properties is a mess) > Clean

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033557#comment-15033557 ] Markus Jelsma commented on NUTCH-2177: -- +1 > Generator produces only one partition e

[jira] [Created] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2015-11-27 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2178: Summary: DeduplicationJob to optionall group on host or domain Key: NUTCH-2178 URL: https://issues.apache.org/jira/browse/NUTCH-2178 Project: Nutch Issue

[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2015-11-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2178: - Attachment: NUTCH-2178.patch Patch for trunk > DeduplicationJob to optionall group on h

[jira] [Created] (NUTCH-2176) log4j.properties is a mess

2015-11-26 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2176: Summary: log4j.properties is a mess Key: NUTCH-2176 URL: https://issues.apache.org/jira/browse/NUTCH-2176 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2176) log4j.properties is a mess

2015-11-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2176: - Attachment: NUTCH-2176.patch Patch for trunk resolving above mentioned points. Anything else

[jira] [Updated] (NUTCH-2176) log4j.properties is a mess

2015-11-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2176: - Affects Version/s: 1.10 Priority: Trivial (was: Major) Fix Version/s: 1.11

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029015#comment-15029015 ] Markus Jelsma commented on NUTCH-2177: -- There seems to be no value for mapred.job.tracker on our own

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029017#comment-15029017 ] Markus Jelsma commented on NUTCH-2177: -- +1 for issue being blocker > Generator produces only

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029054#comment-15029054 ] Markus Jelsma commented on NUTCH-2177: -- On standard Apache Hadoop YARN 2.7.1 running in high

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013513#comment-15013513 ] Markus Jelsma commented on NUTCH-2069: -- Hi - looks good. One suggestion though. The patch mixes up

[jira] [Comment Edited] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013616#comment-15013616 ] Markus Jelsma edited comment on NUTCH-2069 at 11/19/15 2:35 PM: Ah, i see

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013616#comment-15013616 ] Markus Jelsma commented on NUTCH-2069: -- Ah, i see it now indeed. +1 for this patch > Ignore exter

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012193#comment-15012193 ] Markus Jelsma commented on NUTCH-2069: -- Hi J - i agree with the mode! Have it defaulted so it never

[jira] [Commented] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002072#comment-15002072 ] Markus Jelsma commented on NUTCH-2120: -- Im fine with removing it, we're using Hadoop's MapWritable

[jira] [Reopened] (NUTCH-2058) Indexer plugin that allows RegEx replacements on the NutchDocument field values

2015-11-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-2058: -- Reopening due to failing unit tests

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-11-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991421#comment-14991421 ] Markus Jelsma commented on NUTCH-2064: -- It looks good to me, there are no immediate issues that come

[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984806#comment-14984806 ] Markus Jelsma commented on NUTCH-2155: -- By `remove current` and `not require current` you guys mean

[jira] [Commented] (NUTCH-2147) LanguagePreferenceScoringFilter for Nutch

2015-10-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973930#comment-14973930 ] Markus Jelsma commented on NUTCH-2147: -- Hello, we did something quite similar but used Jexl

[jira] [Commented] (NUTCH-2147) MetadataScoringFilter for Nutch

2015-10-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974675#comment-14974675 ] Markus Jelsma commented on NUTCH-2147: -- boolean CrawlDatum.evaluate(Expression expr) is what you need

[jira] [Comment Edited] (NUTCH-2147) MetadataScoringFilter for Nutch

2015-10-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974675#comment-14974675 ] Markus Jelsma edited comment on NUTCH-2147 at 10/26/15 6:00 PM: Hello

RE: [DISCUSS] Release 1.11 RC #1 (70 issues fixed)

2015-10-22 Thread Markus Jelsma
University of Southern California, Los Angeles, CA 90089 USA > ++++++ > > > > > > -Original Message- > From: Markus Jelsma <markus.jel...@openindex.io> > Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org> > Date: Monday, October 19,

RE: [DISCUSS] Release 1.11 RC #1 (70 issues fixed)

2015-10-19 Thread Markus Jelsma
Hi - i think NUTCH-2064 is too important to miss another release. Everyone using Nutch needs it, especially if you are using HTTPS since httpclient cannot deal with unescaped URL's. M. -Original message- > From:Mattmann, Chris A (3980) > Sent: Sunday

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-10-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Patch! Records with a orphan time greater than now > lastInlinkT

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2015-10-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963082#comment-14963082 ] Markus Jelsma commented on NUTCH-2144: -- Hi - i like the purpose of this plugin. The patch, however

[jira] [Commented] (NUTCH-2145) parse/index checker fail to fetch valid percent-encoded URLs

2015-10-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963283#comment-14963283 ] Markus Jelsma commented on NUTCH-2145: -- +1 for passing it through the normalizer. > parse/in

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2015-10-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964110#comment-14964110 ] Markus Jelsma commented on NUTCH-2144: -- Yes, this is much more readable indeed

[jira] [Updated] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-10-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2064: - Summary: URLNormalizer basic to encode reserved chars and decode non-reserved chars

RE: Webcast : Apache Nutch on EMR

2015-09-23 Thread Markus Jelsma
Very cool! This is probably going to be useful. -Original message- From: Julien Nioche Sent: Wednesday 23rd September 2015 16:35 To: u...@nutch.apache.org; dev@nutch.apache.org Subject: Webcast : Apache Nutch on EMR Hi again, I have uploaded at webcast

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Updated patch. CrawlDatum now supports Jexl expressions on Long

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Fixed bad long to int casting. > Automatically remove orpha

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Wrong default in code was used for markOrphanAfter. Config is ok

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Uh, using long over int for time keeping makes no sense. Relies

RE: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-16 Thread Markus Jelsma
Welcome!! -Original message- From: Sujen Shah Sent: Wednesday 16th September 2015 0:58 To: dev@nutch.apache.org Cc: u...@nutch.apache.org Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah Hi Everyone, I would like to thank the members of the Apache

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Probably the final patch. It now includes: * moving reducer code

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747316#comment-14747316 ] Markus Jelsma commented on NUTCH-2102: -- Hello Julien! I believe this warc format is the updated arc

[jira] [Resolved] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2093. -- Resolution: Fixed Assignee: Markus Jelsma Committed to trunk in revision 1703111

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744959#comment-14744959 ] Markus Jelsma commented on NUTCH-2064: -- I think having it in CC makes sense indeed. I shall commit

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744953#comment-14744953 ] Markus Jelsma commented on NUTCH-2097: -- Interesting! What does 'Complete Ant + Ivy build system

[jira] [Comment Edited] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744953#comment-14744953 ] Markus Jelsma edited comment on NUTCH-2097 at 9/15/15 6:50 AM: --- Interesting

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Eeh, patch with the scoring filter itself. Apparently it is possible

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch New and much simpler patch. This relies on a scoring filter to mark

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch > Automatically remove orphaned pa

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch First proper working patch. Tests pass > Automatically rem

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Description: Orphan scoring filter that determines whether a page has become orphaned, e.g

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745322#comment-14745322 ] Markus Jelsma commented on NUTCH-2097: -- Yes, having them as separate mapper and reducer class files

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746034#comment-14746034 ] Markus Jelsma commented on NUTCH-1932: -- Hello Sebastian. I am not sure about that being on the list

[jira] [Created] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-11 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2093: Summary: Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator Key: NUTCH-2093 URL: https://issues.apache.org/jira/browse/NUTCH-2093 Project

[jira] [Updated] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2093: - Attachment: NUTCH-2093.patch Patch for trunk. > Indexing filters have no signature in CrawlDa

RE: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Markus Jelsma
Welcome! -Original message- > From:Sebastian Nagel > Sent: Thursday 10th September 2015 0:01 > To: dev@nutch.apache.org > Cc: u...@nutch.apache.org > Subject: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra > > Dear all, > > on behalf of the Nutch

[jira] [Comment Edited] (NUTCH-1084) ReadDB url throws exception

2015-08-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716809#comment-14716809 ] Markus Jelsma edited comment on NUTCH-1084 at 8/27/15 3:03 PM

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

2015-08-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716809#comment-14716809 ] Markus Jelsma commented on NUTCH-1084: -- I am getting sad, setting

[jira] [Resolved] (NUTCH-2085) Upgrade Guava

2015-08-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2085. -- Resolution: Fixed Assignee: Markus Jelsma Committed to trunk in rev 1697860. Upgrade

[jira] [Resolved] (NUTCH-2084) Track changes in input dirs for SegmentMerger

2015-08-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2084. -- Resolution: Fixed Assignee: Markus Jelsma Committed to trunk in rev 1697858. Track

RE: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Markus Jelsma
Yes Julien, please commit. I do think https://issues.apache.org/jira/browse/NUTCH-2064 should also be included. But i have my hands full atm. -Original message- From: Julien Niochelists.digitalpeb...@gmail.com Sent: Wednesday 26th August 2015 13:51 To: dev@nutch.apache.org Subject: Re:

[jira] [Updated] (NUTCH-2084) Track changes in input dirs for SegmentMerger

2015-08-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2084: - Description: When merging 1000's of segments, and one is corrupt, broken, whatever, the merge

[jira] [Updated] (NUTCH-2084) Track changes in input dirs for SegmentMerger

2015-08-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2084: - Attachment: NUTCH-2084.patch Patch for trunk. Track changes in input dirs for SegmentMerger

[jira] [Commented] (NUTCH-2084) Track changes in input dirs for SegmentMerger

2015-08-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710828#comment-14710828 ] Markus Jelsma commented on NUTCH-2084: -- Well, this immediately helped me track down

[jira] [Created] (NUTCH-2084) Track changes in input dirs for SegmentMerger

2015-08-25 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2084: Summary: Track changes in input dirs for SegmentMerger Key: NUTCH-2084 URL: https://issues.apache.org/jira/browse/NUTCH-2084 Project: Nutch Issue Type: Bug

[jira] [Created] (NUTCH-2085) Upgrade Guava

2015-08-25 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2085: Summary: Upgrade Guava Key: NUTCH-2085 URL: https://issues.apache.org/jira/browse/NUTCH-2085 Project: Nutch Issue Type: Task Affects Versions: 1.10

[jira] [Updated] (NUTCH-2085) Upgrade Guava

2015-08-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2085: - Attachment: NUTCH-2085.patch Patch for trunk. Tests pass except for ParserFactory, which fails

[jira] [Updated] (NUTCH-2085) Upgrade Guava

2015-08-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2085: - Patch Info: Patch Available Upgrade Guava - Key: NUTCH-2085

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646504#comment-14646504 ] Markus Jelsma commented on NUTCH-2069: -- Fine with the feature but there's a lot

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Updated patch for trunk. This still relies on a LinkDB and a CrawlDB

[jira] [Updated] (NUTCH-2068) Allow subcollection overrides via metadata

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2068: - Attachment: NUTCH-2068.patch patch for trunk Allow subcollection overrides via metadata

[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2064: - Attachment: NUTCH-2064.patch Quick and dirty patch where [ and ] are also encoded. I just

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-27 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642500#comment-14642500 ] Markus Jelsma commented on NUTCH-2064: -- Also, spaces are now also escaped

[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2064: - Attachment: NUTCH-1098.patch Excellent! I have added both characters as a new test and it passes

[jira] [Created] (NUTCH-2065) Domain URL filter to support protocols

2015-07-21 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2065: Summary: Domain URL filter to support protocols Key: NUTCH-2065 URL: https://issues.apache.org/jira/browse/NUTCH-2065 Project: Nutch Issue Type: Improvement

<    5   6   7   8   9   10   11   12   13   14   >