[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-24 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2229: - Patch Info: Patch Available Description: CrawlDatum allows Jexl expressions on its metadata

[jira] [Created] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-23 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2229: Summary: Allow Jexl expressions on CrawlDatum's fixed attributes Key: NUTCH-2229 URL: https://issues.apache.org/jira/browse/NUTCH-2229 Project: Nutch Issue

[jira] [Resolved] (NUTCH-2227) RegexParseFilter

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2227. -- Resolution: Fixed Committed to trunk in revision 1731849. > RegexParseFil

[jira] [Updated] (NUTCH-2227) RegexParseFilter

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2227: - Attachment: NUTCH-2227.patch Updated patch. conf/regex-parsefilter.txt was missing in the patch

[jira] [Comment Edited] (NUTCH-2227) RegexParseFilter

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158808#comment-15158808 ] Markus Jelsma edited comment on NUTCH-2227 at 2/23/16 12:45 PM: Updated

[jira] [Updated] (NUTCH-2227) RegexParseFilter

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2227: - Attachment: NUTCH-2227.patch Updated patch. It now includes package-info.java. Will commit

[jira] [Updated] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2216: - Attachment: NUTCH-2216.patch Updated patch for trunk. And included second and third comments

[jira] [Resolved] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2221. -- Resolution: Fixed Assignee: Markus Jelsma > Introduce db.ignore.internal.li

[jira] [Commented] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158691#comment-15158691 ] Markus Jelsma commented on NUTCH-2221: -- Committed to trunk in revision 1731836. > Introd

[jira] [Comment Edited] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158684#comment-15158684 ] Markus Jelsma edited comment on NUTCH-2144 at 2/23/16 10:39 AM

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158684#comment-15158684 ] Markus Jelsma commented on NUTCH-2144: -- ParseOutputFormat.filterNormalize() signature has changed

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Attachment: NUTCH-2221.patch Updated patch for current trunk revision. Will commit shortly

[jira] [Updated] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2220: - Description: We need an option db.ignore.internal.links that operates in FetcherThread, just

[jira] [Updated] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2220: - Description: We need an option db.ignore.internal.links that operates in FetcherThread, just

[jira] [Comment Edited] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158651#comment-15158651 ] Markus Jelsma edited comment on NUTCH-2220 at 2/23/16 10:04 AM: Yes, i

[jira] [Commented] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158651#comment-15158651 ] Markus Jelsma commented on NUTCH-2220: -- Yes, i would opt for an incompatibility note at the top

[jira] [Updated] (NUTCH-2228) Plugin index-replace unit test broken on Java 8

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2228: - Summary: Plugin index-replace unit test broken on Java 8 (was: index-replace unit test fails

[jira] [Commented] (NUTCH-2228) index-replace unit test fails

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158633#comment-15158633 ] Markus Jelsma commented on NUTCH-2228: -- Ah i see! Your patch addresses the problem nicely. I'll

[jira] [Assigned] (NUTCH-2228) index-replace unit test fails

2016-02-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2228: Assignee: Markus Jelsma > index-replace unit test fa

[jira] [Created] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2228: Summary: index-replace unit test fails Key: NUTCH-2228 URL: https://issues.apache.org/jira/browse/NUTCH-2228 Project: Nutch Issue Type: Bug

[jira] [Work stopped] (NUTCH-2227) RegexParseFilter

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2227 stopped by Markus Jelsma. > RegexParseFilter > > > Key

[jira] [Updated] (NUTCH-2227) RegexParseFilter

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2227: - Attachment: NUTCH-2227.patch Updated patch, added negative test. Which works. Will commit

[jira] [Updated] (NUTCH-2227) RegexParseFilter

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2227: - Attachment: NUTCH-2227.patch Updated patch, build.xml was missing > RegexParseFil

[jira] [Updated] (NUTCH-2227) RegexParseFilter

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2227: - Attachment: NUTCH-2227.patch Patch for trunk! Tests pass. > RegexParseFil

[jira] [Work started] (NUTCH-2227) RegexParseFilter

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2227 started by Markus Jelsma. > RegexParseFilter > > > Key

[jira] [Updated] (NUTCH-2227) RegexParseFilter

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2227: - Description: A parse filter that takes a regex and a field name. If regex matches via

[jira] [Created] (NUTCH-2227) RegexParseFilter

2016-02-22 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2227: Summary: RegexParseFilter Key: NUTCH-2227 URL: https://issues.apache.org/jira/browse/NUTCH-2227 Project: Nutch Issue Type: New Feature Components

[jira] [Updated] (NUTCH-2219) Criteria order to be configurable in DeduplicationJob

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2219: - Fix Version/s: 1.12 > Criteria order to be configurable in Deduplication

[jira] [Updated] (NUTCH-2219) Criteria order to be configurable in DeduplicationJob

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2219: - Affects Version/s: 1.11 > Criteria order to be configurable in Deduplication

[jira] [Resolved] (NUTCH-2219) Criteria order to be configurable in DeduplicationJob

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2219. -- Resolution: Fixed Committed to trunk in revision 1731651. Thanks Ron van der Vegt > Crite

[jira] [Commented] (NUTCH-2226) SOLR mismatch in deploy mode

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157027#comment-15157027 ] Markus Jelsma commented on NUTCH-2226: -- Hello - how is this related? Are you using trunk? We run

[jira] [Commented] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-22 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156711#comment-15156711 ] Markus Jelsma commented on NUTCH-2220: -- Any comments to this change, e.g. separate db and linkdb

RE: [RESULT] [VOTE] Moving to Git

2016-02-22 Thread Markus Jelsma
Can someone please put up a small howto somewhere? I need to know how to: * check out trunk * check out a specific tag * do a svn up * create a patch, e.g. svn diff * perform a commit Thanks, Markus -Original message- > From:Mattmann, Chris A (3980) >

[jira] [Updated] (NUTCH-2219) Criteria order to be configurable in DeduplicationJob

2016-02-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2219: - Description: Current implementation: "This command takes a path to a crawldb as para

[jira] [Updated] (NUTCH-2219) Criteria order to be configurable in DeduplicationJob

2016-02-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2219: - Summary: Criteria order to be configurable in DeduplicationJob (was: Dedup script, allow users

[jira] [Updated] (NUTCH-2219) Dedup script, allow users to change the order in which main documents are selected.

2016-02-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2219: - Attachment: NUTCH-2219.patch Thanks, looks fine! Slightly updated patch: * changed usage output

[jira] [Assigned] (NUTCH-2219) Dedup script, allow users to change the order in which main documents are selected.

2016-02-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2219: Assignee: Markus Jelsma > Dedup script, allow users to change the order in which m

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152221#comment-15152221 ] Markus Jelsma commented on NUTCH-2191: -- 1. although that could work, it does not truely resolve

[jira] [Comment Edited] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152184#comment-15152184 ] Markus Jelsma edited comment on NUTCH-2191 at 2/18/16 11:34 AM: 1. ah yes

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152184#comment-15152184 ] Markus Jelsma commented on NUTCH-2191: -- 1. ah yes,we still need to fix this crazy plugin dependency

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152141#comment-15152141 ] Markus Jelsma commented on NUTCH-2191: -- Hello Kshijtij - well no, certainly not at this time

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152140#comment-15152140 ] Markus Jelsma commented on NUTCH-2191: -- Hi - it works indeed. But new problems appear, as usual! 1

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151154#comment-15151154 ] Markus Jelsma commented on NUTCH-2191: -- Hi Karanjeet - looks like the only changes you made

[jira] [Resolved] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2223. -- Resolution: Fixed Committed to trunk in revision 1730808. > Upgrade xercesImpl to 2.1

[jira] [Commented] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150264#comment-15150264 ] Markus Jelsma commented on NUTCH-2223: -- Thanks Tien Nguyen Manh! > Upgrade xercesImpl to 2.1

[jira] [Commented] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150248#comment-15150248 ] Markus Jelsma commented on NUTCH-2223: -- Incredible, i tried the tika-breaker.html file in the linked

[jira] [Assigned] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2223: Assignee: Markus Jelsma > Upgrade xercesImpl to 2.11.0 to fix hang on issue in t

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Priority: Major (was: Minor) > Upgrade xercesImpl to 2.11.0 to fix hang on issue in t

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Description: Stracktrace for the hang seems to be: {code

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Fix Version/s: 1.12 > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimet

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Description: {code}Stracktrace for the hang seems

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Component/s: fetcher > Average bytes/second calculated incorrectly in fetc

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Affects Version/s: 1.11 > Average bytes/second calculated incorrectly in fetc

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Fix Version/s: 1.12 > Average bytes/second calculated incorrectly in fetc

[jira] [Resolved] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2224. -- Resolution: Fixed Committed to trunk in revision 1730803. Thanks Tien Nguyen Manh! > Aver

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Summary: Average bytes/second calculated incorrectly in fetcher (was: Wrong metric compute

[jira] [Assigned] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2224: Assignee: Markus Jelsma > Wrong metric compute in Fetcher status rep

[jira] [Resolved] (NUTCH-2225) Parsed time calculated incorrectly

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2225. -- Resolution: Fixed Committed to trunk in revision 1730802. Thanks Tien Nguyen Manh! > Par

[jira] [Updated] (NUTCH-2225) Parsed time calculated incorrectly

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2225: - Summary: Parsed time calculated incorrectly (was: Parsed time not include time to parse

[jira] [Assigned] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2225: Assignee: Markus Jelsma > Parsed time not include time to pa

[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2225: - Affects Version/s: 1.11 > Parsed time not include time to pa

[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2225: - Fix Version/s: 1.12 > Parsed time not include time to pa

[jira] [Resolved] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-961. - Resolution: Fixed Committed to trunk in revision 1730694. Thanks everyone for contributions

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Attachment: NUTCH-961.patch Updated patch. ExtractorRepository was missing. > Expose Tik

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Fix Version/s: 1.12 > Expose Tika's boilerpipe supp

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Affects Version/s: 1.11 > Expose Tika's boilerpipe supp

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148642#comment-15148642 ] Markus Jelsma commented on NUTCH-961: - Tests pass as expected and Boilerpipe as well. Will commit

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Description: Tika 0.8 comes with the Boilerpipe content handler which can be used to extract

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Description: Tika 0.8 comes with the Boilerpipe content handler which can be used to extract

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Attachment: NUTCH-961.patch Patch for trunk. > Expose Tika's boilerpipe supp

[jira] [Resolved] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1233. -- Resolution: Fixed Committed to trunk in revision 1730687. > Rely on Tika for outl

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Affects Version/s: 1.11 > Rely on Tika for outlink extract

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Fix Version/s: 1.12 > Rely on Tika for outlink extract

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Component/s: parser > Rely on Tika for outlink extract

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148590#comment-15148590 ] Markus Jelsma commented on NUTCH-1233: -- Awesome! Everything works as expected since the Tika 1.12

[jira] [Resolved] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2210. -- Resolution: Fixed Committed to trunk in revision 1730686. > Upgrade to Tika 1

[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148572#comment-15148572 ] Markus Jelsma commented on NUTCH-2210: -- Test passes, will commit shortly. > Upgrade to Tika 1

[jira] [Updated] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2210: - Attachment: NUTCH-2210.patch Patch for trunk. > Upgrade to Tika 1

[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148489#comment-15148489 ] Markus Jelsma commented on NUTCH-2197: -- Hello Arun - no, this is not applied to 2.3.1. The plugins

[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147735#comment-15147735 ] Markus Jelsma commented on NUTCH-2210: -- Apache Tika 1.12 is available. Will upgrade as soon

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Attachment: NUTCH-2216-NUTCH-2220-NUTCH-2221.patch Patch for trunk. This includes all three

[jira] [Updated] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2216: - Attachment: NUTCH-2216.patch Patch for trunk, introducing db.ignore.treat.redirects.as.links

[jira] [Updated] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2216: - Summary: db.ignore.*.links to optionally follow internal redirects (was: ignore.internal.links

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Attachment: NUTCH-2221.patch Patch for trunk. This includes the modified config of NUTCH-2220

[jira] [Updated] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2220: - Patch Info: Patch Available > Rename db.* options used only by the linkdb to lin

[jira] [Updated] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2220: - Attachment: NUTCH-2220.patch Patch for trunk > Rename db.* options used only by the lin

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Description: FetcherThread has support for db.ignore.external.links. In config you can find

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Summary: Introduce db.ignore.internal.links to FetcherThread (was: Introduce

[jira] [Created] (NUTCH-2221) Introduce db.ignore.external.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2221: Summary: Introduce db.ignore.external.links to FetcherThread Key: NUTCH-2221 URL: https://issues.apache.org/jira/browse/NUTCH-2221 Project: Nutch Issue Type

[jira] [Created] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-15 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2220: Summary: Rename db.* options used only by the linkdb to linkdb.* Key: NUTCH-2220 URL: https://issues.apache.org/jira/browse/NUTCH-2220 Project: Nutch Issue

[jira] [Resolved] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2189. -- Resolution: Fixed > Domain filter must deactivate if no rules are pres

[jira] [Updated] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2189: - Fix Version/s: 1.12 > Domain filter must deactivate if no rules are pres

[jira] [Updated] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2189: - Affects Version/s: 1.11 > Domain filter must deactivate if no rules are pres

[jira] [Reopened] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-2189: -- Fix version missing > Domain filter must deactivate if no rules are pres

[jira] [Closed] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2189. > Domain filter must deactivate if no rules are pres

[jira] [Created] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2216: Summary: ignore.internal.links to optionally follow internal redirects Key: NUTCH-2216 URL: https://issues.apache.org/jira/browse/NUTCH-2216 Project: Nutch

[jira] [Commented] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144463#comment-15144463 ] Markus Jelsma commented on NUTCH-2216: -- Apparently db.ignore.internal.links is not implemented

[jira] [Commented] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144497#comment-15144497 ] Markus Jelsma commented on NUTCH-2216: -- Additionally, it probably should not be implemented because

[jira] [Commented] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144518#comment-15144518 ] Markus Jelsma commented on NUTCH-2216: -- An option is to change the default

[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2215: - Attachment: NUTCH-2215.patch Tiny error in nutch-default description. > Generator to restr

<    3   4   5   6   7   8   9   10   11   12   >