[jira] [Updated] (NUTCH-2750) improve CrawlDbReader & LinkDbReader reader handling
[ https://issues.apache.org/jira/browse/NUTCH-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2750: Description: The current implementation in the CrawlDbReader re-opens readers for every URL. This is not very efficient. I've implemented a modification time check that only re-opens readers on updated crawlDB. PR: https://github.com/apache/nutch/pull/483 was: The current implementation in the CrawlDbReader re-opens readers for every URL. This is not very efficient. I've implemented a modification time check that only re-opens readers on updated crawlDB. > improve CrawlDbReader & LinkDbReader reader handling > > > Key: NUTCH-2750 > URL: https://issues.apache.org/jira/browse/NUTCH-2750 > Project: Nutch > Issue Type: Improvement > Components: crawldb, linkdb >Affects Versions: 1.16 >Reporter: Jurian Broertjes >Priority: Minor > > The current implementation in the CrawlDbReader re-opens readers for every > URL. This is not very efficient. I've implemented a modification time check > that only re-opens readers on updated crawlDB. > PR: https://github.com/apache/nutch/pull/483 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2750) improve CrawlDbReader & LinkDbReader reader handling
Jurian Broertjes created NUTCH-2750: --- Summary: improve CrawlDbReader & LinkDbReader reader handling Key: NUTCH-2750 URL: https://issues.apache.org/jira/browse/NUTCH-2750 Project: Nutch Issue Type: Improvement Components: crawldb, linkdb Affects Versions: 1.16 Reporter: Jurian Broertjes The current implementation in the CrawlDbReader re-opens readers for every URL. This is not very efficient. I've implemented a modification time check that only re-opens readers on updated crawlDB. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2717) Generator cannot open hostDB
[ https://issues.apache.org/jira/browse/NUTCH-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2717: Description: During generate, the hostDB cannot be opened anymore, see: {quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB because File file:/hostdb/current/part-r-0/data does not exist {quote} PR: https://github.com/apache/nutch/pull/455 was: During generate, the hostDB cannot be opened anymore, see: {quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB because File file:/hostdb/current/part-r-0/data does not exist {quote} I will create a PR for the fix > Generator cannot open hostDB > > > Key: NUTCH-2717 > URL: https://issues.apache.org/jira/browse/NUTCH-2717 > Project: Nutch > Issue Type: Bug > Components: generator, hostdb >Affects Versions: 1.15 >Reporter: Jurian Broertjes >Priority: Minor > > During generate, the hostDB cannot be opened anymore, see: > {quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB > because File file:/hostdb/current/part-r-0/data does not exist > {quote} > PR: https://github.com/apache/nutch/pull/455 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2717) Generator cannot open hostDB
Jurian Broertjes created NUTCH-2717: --- Summary: Generator cannot open hostDB Key: NUTCH-2717 URL: https://issues.apache.org/jira/browse/NUTCH-2717 Project: Nutch Issue Type: Bug Components: generator, hostdb Affects Versions: 1.15 Reporter: Jurian Broertjes During generate, the hostDB cannot be opened anymore, see: {quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB because File file:/hostdb/current/part-r-0/data does not exist {quote} I will create a PR for the fix -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata
[ https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2525: Attachment: NUTCH-2525-p1.patch > Metadata indexer cannot handle uppercase parse metadata > --- > > Key: NUTCH-2525 > URL: https://issues.apache.org/jira/browse/NUTCH-2525 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.16 > > Attachments: NUTCH-2525-p1.patch, NUTCH-2525.patch > > > MetadataIndexer lowercases keys for parse metadata, making it impossible to > index metadata containing uppercase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata
[ https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834822#comment-16834822 ] Jurian Broertjes commented on NUTCH-2525: - Updated patch so it applies against master > Metadata indexer cannot handle uppercase parse metadata > --- > > Key: NUTCH-2525 > URL: https://issues.apache.org/jira/browse/NUTCH-2525 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.16 > > Attachments: NUTCH-2525-p1.patch, NUTCH-2525.patch > > > MetadataIndexer lowercases keys for parse metadata, making it impossible to > index metadata containing uppercase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums
[ https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512589#comment-16512589 ] Jurian Broertjes commented on NUTCH-2565: - Updated PR with the proposed solution > MergeDB incorrectly handles unfetched CrawlDatums > - > > Key: NUTCH-2565 > URL: https://issues.apache.org/jira/browse/NUTCH-2565 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Jurian Broertjes >Priority: Minor > > I ran into this issue when merging a crawlDB originating from sitemaps into > our normal crawlDB. CrawlDatums are merged based on output of > AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are > unfetched, this can overwrite fetchTime or other stuff. > I assume this is a bug and have a simple fix for it that checks if CrawlDatum > has status db_unfetched. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2597) NPE in updatehostdb
[ https://issues.apache.org/jira/browse/NUTCH-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511082#comment-16511082 ] Jurian Broertjes commented on NUTCH-2597: - PR: [https://github.com/apache/nutch/pull/349] Fixes cleanup(), also for indexer/CleaningJob.java > NPE in updatehostdb > --- > > Key: NUTCH-2597 > URL: https://issues.apache.org/jira/browse/NUTCH-2597 > Project: Nutch > Issue Type: Bug > Components: hostdb >Affects Versions: 1.15 >Reporter: Jurian Broertjes >Priority: Critical > > I get an NPE on updatehostdb. I start with a clean crawlDB & hostDB. After an > inject, I do an updatehostdb with -checkAll and get the following stacktrace: > {code} > 2018-06-13 10:45:21,958 WARN hostdb.ResolverThread - > java.lang.NullPointerException > at > org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:1359) > at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1400) > at > org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:83) > at > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) > at org.apache.nutch.hostdb.ResolverThread.run(ResolverThread.java:82) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > Is this related to NUTCH-2375? > If further testing is needed, please let me know! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2597) NPE in updatehostdb
Jurian Broertjes created NUTCH-2597: --- Summary: NPE in updatehostdb Key: NUTCH-2597 URL: https://issues.apache.org/jira/browse/NUTCH-2597 Project: Nutch Issue Type: Bug Components: hostdb Affects Versions: 1.15 Reporter: Jurian Broertjes I get an NPE on updatehostdb. I start with a clean crawlDB & hostDB. After an inject, I do an updatehostdb with -checkAll and get the following stacktrace: 2018-06-13 10:45:21,958 WARN hostdb.ResolverThread - java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:1359) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1400) at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:83) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.nutch.hostdb.ResolverThread.run(ResolverThread.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Is this related to NUTCH-2375? If further testing is needed, please let me know! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums
[ https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509801#comment-16509801 ] Jurian Broertjes commented on NUTCH-2565: - Maybe it would be sufficient to only test on STATUS_DB_UNFETCHED in calculateLastFetchTime(datum), but fallback on CrawlDatum.getFetchTime() in the merger and pick the newest according to that. That way we could also just pick the retries value from the newest one and keep it simple. I'll add a PR later for review > MergeDB incorrectly handles unfetched CrawlDatums > - > > Key: NUTCH-2565 > URL: https://issues.apache.org/jira/browse/NUTCH-2565 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Jurian Broertjes >Priority: Minor > > I ran into this issue when merging a crawlDB originating from sitemaps into > our normal crawlDB. CrawlDatums are merged based on output of > AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are > unfetched, this can overwrite fetchTime or other stuff. > I assume this is a bug and have a simple fix for it that checks if CrawlDatum > has status db_unfetched. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker
[ https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509686#comment-16509686 ] Jurian Broertjes commented on NUTCH-2012: - It looks like the process() function still uses System.out.println for output, instead of the output StringBuilder. I can supply a small PR to fix it. > Merge parsechecker and indexchecker > --- > > Key: NUTCH-2012 > URL: https://issues.apache.org/jira/browse/NUTCH-2012 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.10 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.15 > > > ParserChecker and IndexingFiltersChecker have evolved from simple tools to > check parsers and parsefilters resp. indexing filters to powerful tools which > emulate the crawling of a single URL/document: > - check robots.txt (NUTCH-2002) > - follow redirects (NUTCH-2004) > Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also > NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge > them > * either into one general debugging tool, keeping parsechecker and > indexchecker as aliases > * centralize common code in one utility class -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums
[ https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509573#comment-16509573 ] Jurian Broertjes commented on NUTCH-2565: - One solution would be to sum the retries of both CrawlDatums. We could do this only for db_unfetched or for others aswell. What do you think would be appropriate? > MergeDB incorrectly handles unfetched CrawlDatums > - > > Key: NUTCH-2565 > URL: https://issues.apache.org/jira/browse/NUTCH-2565 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Jurian Broertjes >Priority: Minor > > I ran into this issue when merging a crawlDB originating from sitemaps into > our normal crawlDB. CrawlDatums are merged based on output of > AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are > unfetched, this can overwrite fetchTime or other stuff. > I assume this is a bug and have a simple fix for it that checks if CrawlDatum > has status db_unfetched. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums
[ https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432083#comment-16432083 ] Jurian Broertjes commented on NUTCH-2565: - PR: https://github.com/apache/nutch/pull/311 > MergeDB incorrectly handles unfetched CrawlDatums > - > > Key: NUTCH-2565 > URL: https://issues.apache.org/jira/browse/NUTCH-2565 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Jurian Broertjes >Priority: Minor > > I ran into this issue when merging a crawlDB originating from sitemaps into > our normal crawlDB. CrawlDatums are merged based on output of > AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are > unfetched, this can overwrite fetchTime or other stuff. > I assume this is a bug and have a simple fix for it that checks if CrawlDatum > has status db_unfetched. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums
Jurian Broertjes created NUTCH-2565: --- Summary: MergeDB incorrectly handles unfetched CrawlDatums Key: NUTCH-2565 URL: https://issues.apache.org/jira/browse/NUTCH-2565 Project: Nutch Issue Type: Bug Affects Versions: 1.14 Reporter: Jurian Broertjes I ran into this issue when merging a crawlDB originating from sitemaps into our normal crawlDB. CrawlDatums are merged based on output of AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are unfetched, this can overwrite fetchTime or other stuff. I assume this is a bug and have a simple fix for it that checks if CrawlDatum has status db_unfetched. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2543) readdb & readlinkdb to implement AbstractChecker
[ https://issues.apache.org/jira/browse/NUTCH-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409584#comment-16409584 ] Jurian Broertjes commented on NUTCH-2543: - PR: [https://github.com/apache/nutch/pull/303] PR also includes a fix for AbstractChecker when in keepClientCnxOpen mode that resulted in partial/corrupted results due to re-creating BufferedReader objects. > readdb & readlinkdb to implement AbstractChecker > > > Key: NUTCH-2543 > URL: https://issues.apache.org/jira/browse/NUTCH-2543 > Project: Nutch > Issue Type: Improvement > Components: crawldb, linkdb >Reporter: Jurian Broertjes >Priority: Minor > Labels: patch > > Implement AbstractChecker in LinkDbReader & CrawlDbReader classes, so we can > expose them via TCP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2543) readdb & readlinkdb to implement AbstractChecker
Jurian Broertjes created NUTCH-2543: --- Summary: readdb & readlinkdb to implement AbstractChecker Key: NUTCH-2543 URL: https://issues.apache.org/jira/browse/NUTCH-2543 Project: Nutch Issue Type: Improvement Components: crawldb, linkdb Reporter: Jurian Broertjes Implement AbstractChecker in LinkDbReader & CrawlDbReader classes, so we can expose them via TCP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2321) Indexing filter checker leaks threads
[ https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316670#comment-16316670 ] Jurian Broertjes commented on NUTCH-2321: - Reworked patch, less messy. PR: https://github.com/apache/nutch/pull/272 > Indexing filter checker leaks threads > - > > Key: NUTCH-2321 > URL: https://issues.apache.org/jira/browse/NUTCH-2321 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2321.patch > > > Same issue as NUTCH-2320. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2382) indexer-hbase Nutch 1.x branch
[ https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295034#comment-16295034 ] Jurian Broertjes commented on NUTCH-2382: - Yeah +1 for that. > indexer-hbase Nutch 1.x branch > -- > > Key: NUTCH-2382 > URL: https://issues.apache.org/jira/browse/NUTCH-2382 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.13 >Reporter: Jurian Broertjes > Fix For: 1.15 > > Attachments: NUTCH-2382-indexer-hbase-p1.patch > > > I've ported the indexer-hbase for Nutch 2.x > (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. > Patch is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2431) URLFilterchecker to implement Tool-interface
[ https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294783#comment-16294783 ] Jurian Broertjes commented on NUTCH-2431: - Yes, this is indeed resolved by NUTCH-2477 > URLFilterchecker to implement Tool-interface > > > Key: NUTCH-2431 > URL: https://issues.apache.org/jira/browse/NUTCH-2431 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Priority: Minor > Labels: urlfilter > Attachments: NUTCH-2431.patch > > > The current implementation of the URLFilterChecker does not allow for > commandline config overrides. It needs to implement the Tool interface for > this. > Please see the attached patch -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2380) indexer-elastic version bump
[ https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294778#comment-16294778 ] Jurian Broertjes commented on NUTCH-2380: - I've tested it a while back, and it's currently also running for a customer. I guess it should be fine for 1.14 > indexer-elastic version bump > > > Key: NUTCH-2380 > URL: https://issues.apache.org/jira/browse/NUTCH-2380 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2380-indexer-elastic-p0.patch > > > The current version of the indexer-elastic plugin is not compatible with ES > 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch > classloader fix (NUTCH-2378) due to runtime dependency issues. > I didn't test compatibility with ES 2.x, so not sure if that still works. > Please let me know what you think of the provided patch. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2477) Refactor *Checker classes to use base class for common code
[ https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287823#comment-16287823 ] Jurian Broertjes commented on NUTCH-2477: - Feedback is welcome > Refactor *Checker classes to use base class for common code > --- > > Key: NUTCH-2477 > URL: https://issues.apache.org/jira/browse/NUTCH-2477 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Priority: Minor > Labels: pull-request-available > > The various Checker class implementations have quite a bit of duplicated code > in them. This should be refactored for cleanliness and maintainability. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2477) Refactor *Checker classes to use base class for common code
[ https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2477: External issue URL: https://github.com/apache/nutch/pull/256 > Refactor *Checker classes to use base class for common code > --- > > Key: NUTCH-2477 > URL: https://issues.apache.org/jira/browse/NUTCH-2477 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Priority: Minor > Labels: pull-request-available > > The various Checker class implementations have quite a bit of duplicated code > in them. This should be refactored for cleanliness and maintainability. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2477) Refactor *Checker classes to use base class for common code
Jurian Broertjes created NUTCH-2477: --- Summary: Refactor *Checker classes to use base class for common code Key: NUTCH-2477 URL: https://issues.apache.org/jira/browse/NUTCH-2477 Project: Nutch Issue Type: Improvement Affects Versions: 1.13 Reporter: Jurian Broertjes Priority: Minor The various Checker class implementations have quite a bit of duplicated code in them. This should be refactored for cleanliness and maintainability. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2431) Filterchecker to implement Tool-interface
[ https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241941#comment-16241941 ] Jurian Broertjes commented on NUTCH-2431: - Will have a look at your feedback the coming week > Filterchecker to implement Tool-interface > - > > Key: NUTCH-2431 > URL: https://issues.apache.org/jira/browse/NUTCH-2431 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Priority: Minor > Labels: urlfilter > Attachments: NUTCH-2431.patch > > > The current implementation of the URLFilterChecker does not allow for > commandline config overrides. It needs to implement the Tool interface for > this. > Please see the attached patch -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2431) Filterchecker to implement Tool-interface
[ https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2431: Attachment: NUTCH-2431.patch > Filterchecker to implement Tool-interface > - > > Key: NUTCH-2431 > URL: https://issues.apache.org/jira/browse/NUTCH-2431 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Priority: Minor > Labels: urlfilter > Attachments: NUTCH-2431.patch > > > The current implementation of the URLFilterChecker does not allow for > commandline config overrides. It needs to implement the Tool interface for > this. > Please see the attached patch -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2431) Filterchecker to implement Tool-interface
Jurian Broertjes created NUTCH-2431: --- Summary: Filterchecker to implement Tool-interface Key: NUTCH-2431 URL: https://issues.apache.org/jira/browse/NUTCH-2431 Project: Nutch Issue Type: Improvement Affects Versions: 1.13 Reporter: Jurian Broertjes Priority: Minor The current implementation of the URLFilterChecker does not allow for commandline config overrides. It needs to implement the Tool interface for this. Please see the attached patch -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2382) indexer-hbase Nutch 1.x branch
[ https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2382: Attachment: NUTCH-2382-indexer-hbase-p1.patch > indexer-hbase Nutch 1.x branch > -- > > Key: NUTCH-2382 > URL: https://issues.apache.org/jira/browse/NUTCH-2382 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.13 >Reporter: Jurian Broertjes > Attachments: NUTCH-2382-indexer-hbase-p1.patch > > > I've ported the indexer-hbase for Nutch 2.x > (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. > Patch is attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Issue Comment Deleted] (NUTCH-2373) Indexer for Hbase
[ https://issues.apache.org/jira/browse/NUTCH-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2373: Comment: was deleted (was: Nutch 1.x version) > Indexer for Hbase > - > > Key: NUTCH-2373 > URL: https://issues.apache.org/jira/browse/NUTCH-2373 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 2.3 >Reporter: Kaidul Islam >Assignee: Kaidul Islam > Fix For: 2.4 > > > Some use-case involves storing the documents in some sort of database other > than indexing search engines i.e. Solr, ElasticSearch. This is a plugin to > send the documents to Hbase storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2373) Indexer for Hbase
[ https://issues.apache.org/jira/browse/NUTCH-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992836#comment-15992836 ] Jurian Broertjes commented on NUTCH-2373: - Nutch 1.x version > Indexer for Hbase > - > > Key: NUTCH-2373 > URL: https://issues.apache.org/jira/browse/NUTCH-2373 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 2.3 >Reporter: Kaidul Islam >Assignee: Kaidul Islam > Fix For: 2.4 > > > Some use-case involves storing the documents in some sort of database other > than indexing search engines i.e. Solr, ElasticSearch. This is a plugin to > send the documents to Hbase storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2382) indexer-hbase Nutch 1.x branch
Jurian Broertjes created NUTCH-2382: --- Summary: indexer-hbase Nutch 1.x branch Key: NUTCH-2382 URL: https://issues.apache.org/jira/browse/NUTCH-2382 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.13 Reporter: Jurian Broertjes I've ported the indexer-hbase for Nutch 2.x (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. Patch is attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2380) indexer-elastic version bump
[ https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2380: Attachment: NUTCH-2380-indexer-elastic-p0.patch > indexer-elastic version bump > > > Key: NUTCH-2380 > URL: https://issues.apache.org/jira/browse/NUTCH-2380 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.13 >Reporter: Jurian Broertjes >Priority: Minor > Attachments: NUTCH-2380-indexer-elastic-p0.patch > > > The current version of the indexer-elastic plugin is not compatible with ES > 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch > classloader fix (NUTCH-2378) due to runtime dependency issues. > I didn't test compatibility with ES 2.x, so not sure if that still works. > Please let me know what you think of the provided patch. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2380) indexer-elastic version bump
Jurian Broertjes created NUTCH-2380: --- Summary: indexer-elastic version bump Key: NUTCH-2380 URL: https://issues.apache.org/jira/browse/NUTCH-2380 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.13 Reporter: Jurian Broertjes Priority: Minor The current version of the indexer-elastic plugin is not compatible with ES 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch classloader fix (NUTCH-2378) due to runtime dependency issues. I didn't test compatibility with ES 2.x, so not sure if that still works. Please let me know what you think of the provided patch. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2378) ChildFirst plugin classloader
[ https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2378: Attachment: NUTCH-2378-childfirst-plugin-classloader.patch > ChildFirst plugin classloader > - > > Key: NUTCH-2378 > URL: https://issues.apache.org/jira/browse/NUTCH-2378 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Jurian Broertjes > Attachments: NUTCH-2378-childfirst-plugin-classloader.patch > > > While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran > into several nasty runtime dependency issues (both local and on Hadoop). > After seeking help on the mailing list, I still was unable to resolve these > issues and after digging further, decided to try a different plugin > classloader strategy. > The normal classloader delegates class loading requests to it's parent > classloader. This can cause all sorts of nasty runtime dependency version > conflicts (jar hell, version conflicts), since the plugin's own classloader > gets queried last. The child-first classloader approach tries to load a class > from the plugin's dependencies first and when unavailable, delegates to it's > parent classloader. This fixed the issues I had. > The new approach can give runtime LinkageErrors, but these are easily > resolvable (see the patch for a few examples) > I've tested the new loader a bit and am curious about others' findings. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2378) ChildFirst plugin classloader
Jurian Broertjes created NUTCH-2378: --- Summary: ChildFirst plugin classloader Key: NUTCH-2378 URL: https://issues.apache.org/jira/browse/NUTCH-2378 Project: Nutch Issue Type: Improvement Affects Versions: 1.13 Reporter: Jurian Broertjes While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran into several nasty runtime dependency issues (both local and on Hadoop). After seeking help on the mailing list, I still was unable to resolve these issues and after digging further, decided to try a different plugin classloader strategy. The normal classloader delegates class loading requests to it's parent classloader. This can cause all sorts of nasty runtime dependency version conflicts (jar hell, version conflicts), since the plugin's own classloader gets queried last. The child-first classloader approach tries to load a class from the plugin's dependencies first and when unavailable, delegates to it's parent classloader. This fixed the issues I had. The new approach can give runtime LinkageErrors, but these are easily resolvable (see the patch for a few examples) I've tested the new loader a bit and am curious about others' findings. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2242) lastModified not always set
[ https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279842#comment-15279842 ] Jurian Broertjes commented on NUTCH-2242: - Hi Sebastian, I've put this in the reduce() function because that is where a generic modified/not-modified check is done. I think it would make sense to do setModifiedTime() there, together with setSignature(). The one in DefaultFetchSchedule is only for setting the modified time on the first successful fetch. > lastModified not always set > --- > > Key: NUTCH-2242 > URL: https://issues.apache.org/jira/browse/NUTCH-2242 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Jurian Broertjes >Priority: Minor > Fix For: 1.12 > > Attachments: NUTCH-2242.patch > > > I observed two issues: > - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not > updated on the first successful fetch. > - When a document modification is detected (protocol- or signature-wise), the > modifiedTime isn't updated > I can provide a patch later today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2242) lastModified not always set
[ https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2242: Flags: Patch Patch Info: Patch Available > lastModified not always set > --- > > Key: NUTCH-2242 > URL: https://issues.apache.org/jira/browse/NUTCH-2242 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Jurian Broertjes >Priority: Minor > Attachments: NUTCH-2242.patch > > > I observed two issues: > - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not > updated on the first successful fetch. > - When a document modification is detected (protocol- or signature-wise), the > modifiedTime isn't updated > I can provide a patch later today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2242) lastModified not always set
[ https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2242: Attachment: NUTCH-2242.patch Initial version of patch. Please review > lastModified not always set > --- > > Key: NUTCH-2242 > URL: https://issues.apache.org/jira/browse/NUTCH-2242 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Jurian Broertjes >Priority: Minor > Attachments: NUTCH-2242.patch > > > I observed two issues: > - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not > updated on the first successful fetch. > - When a document modification is detected (protocol- or signature-wise), the > modifiedTime isn't updated > I can provide a patch later today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2242) lastModified not always set
Jurian Broertjes created NUTCH-2242: --- Summary: lastModified not always set Key: NUTCH-2242 URL: https://issues.apache.org/jira/browse/NUTCH-2242 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.11 Reporter: Jurian Broertjes Priority: Minor I observed two issues: - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not updated on the first successful fetch. - When a document modification is detected (protocol- or signature-wise), the modifiedTime isn't updated I can provide a patch later today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2203: Attachment: NUTCH-2203.patch Attached a patch to fix this. > Suffix URL filter can't handle trailing/leading whitespaces > --- > > Key: NUTCH-2203 > URL: https://issues.apache.org/jira/browse/NUTCH-2203 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.11 >Reporter: Jurian Broertjes >Priority: Trivial > Attachments: NUTCH-2203.patch > > > I ran into an issue where some lines in suffix-urlfilter.txt contained > trailing whitespaces and caused the filtering to misbehave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces
Jurian Broertjes created NUTCH-2203: --- Summary: Suffix URL filter can't handle trailing/leading whitespaces Key: NUTCH-2203 URL: https://issues.apache.org/jira/browse/NUTCH-2203 Project: Nutch Issue Type: Bug Affects Versions: 1.11 Reporter: Jurian Broertjes Priority: Trivial I ran into an issue where some lines in suffix-urlfilter.txt contained trailing whitespaces and caused the filtering to misbehave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2197: Attachment: NUTCH-2197.patch I've attached a patch with initial support for Solr5 + SolrCloud. Please review it. > Add solr5 solrcloud indexer support > --- > > Key: NUTCH-2197 > URL: https://issues.apache.org/jira/browse/NUTCH-2197 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.12 >Reporter: Jurian Broertjes >Priority: Minor > Attachments: NUTCH-2197.patch > > > Nutch cannot index to Solr5. Also proper SolrCloud support is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)