[jira] [Updated] (NUTCH-2750) improve CrawlDbReader & LinkDbReader reader handling

2019-10-24 Thread Jurian Broertjes (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2750:

Description: 
The current implementation in the CrawlDbReader re-opens readers for every URL. 
This is not very efficient. I've implemented a modification time check that 
only re-opens readers on updated crawlDB.

 PR: https://github.com/apache/nutch/pull/483

  was:
The current implementation in the CrawlDbReader re-opens readers for every URL. 
This is not very efficient. I've implemented a modification time check that 
only re-opens readers on updated crawlDB.

 


> improve CrawlDbReader & LinkDbReader reader handling
> 
>
> Key: NUTCH-2750
> URL: https://issues.apache.org/jira/browse/NUTCH-2750
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, linkdb
>Affects Versions: 1.16
>Reporter: Jurian Broertjes
>Priority: Minor
>
> The current implementation in the CrawlDbReader re-opens readers for every 
> URL. This is not very efficient. I've implemented a modification time check 
> that only re-opens readers on updated crawlDB.
>  PR: https://github.com/apache/nutch/pull/483



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2750) improve CrawlDbReader & LinkDbReader reader handling

2019-10-24 Thread Jurian Broertjes (Jira)
Jurian Broertjes created NUTCH-2750:
---

 Summary: improve CrawlDbReader & LinkDbReader reader handling
 Key: NUTCH-2750
 URL: https://issues.apache.org/jira/browse/NUTCH-2750
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, linkdb
Affects Versions: 1.16
Reporter: Jurian Broertjes


The current implementation in the CrawlDbReader re-opens readers for every URL. 
This is not very efficient. I've implemented a modification time check that 
only re-opens readers on updated crawlDB.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2717) Generator cannot open hostDB

2019-05-16 Thread Jurian Broertjes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2717:

Description: 
During generate, the hostDB cannot be opened anymore, see:
{quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB 
because File file:/hostdb/current/part-r-0/data does not exist
{quote}
PR: https://github.com/apache/nutch/pull/455

  was:
During generate, the hostDB cannot be opened anymore, see:
{quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB 
because File file:/hostdb/current/part-r-0/data does not exist
{quote}
I will create a PR for the fix


> Generator cannot open hostDB
> 
>
> Key: NUTCH-2717
> URL: https://issues.apache.org/jira/browse/NUTCH-2717
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, hostdb
>Affects Versions: 1.15
>Reporter: Jurian Broertjes
>Priority: Minor
>
> During generate, the hostDB cannot be opened anymore, see:
> {quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB 
> because File file:/hostdb/current/part-r-0/data does not exist
> {quote}
> PR: https://github.com/apache/nutch/pull/455



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2717) Generator cannot open hostDB

2019-05-16 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2717:
---

 Summary: Generator cannot open hostDB
 Key: NUTCH-2717
 URL: https://issues.apache.org/jira/browse/NUTCH-2717
 Project: Nutch
  Issue Type: Bug
  Components: generator, hostdb
Affects Versions: 1.15
Reporter: Jurian Broertjes


During generate, the hostDB cannot be opened anymore, see:
{quote}2019-05-16 15:53:50,134 ERROR crawl.Generator - Error reading HostDB 
because File file:/hostdb/current/part-r-0/data does not exist
{quote}
I will create a PR for the fix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata

2019-05-07 Thread Jurian Broertjes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2525:

Attachment: NUTCH-2525-p1.patch

> Metadata indexer cannot handle uppercase parse metadata
> ---
>
> Key: NUTCH-2525
> URL: https://issues.apache.org/jira/browse/NUTCH-2525
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2525-p1.patch, NUTCH-2525.patch
>
>
> MetadataIndexer lowercases keys for parse metadata, making it impossible to 
> index metadata containing uppercase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata

2019-05-07 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834822#comment-16834822
 ] 

Jurian Broertjes commented on NUTCH-2525:
-

Updated patch so it applies against master

> Metadata indexer cannot handle uppercase parse metadata
> ---
>
> Key: NUTCH-2525
> URL: https://issues.apache.org/jira/browse/NUTCH-2525
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2525-p1.patch, NUTCH-2525.patch
>
>
> MetadataIndexer lowercases keys for parse metadata, making it impossible to 
> index metadata containing uppercase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-14 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512589#comment-16512589
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

Updated PR with the proposed solution

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2597) NPE in updatehostdb

2018-06-13 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511082#comment-16511082
 ] 

Jurian Broertjes commented on NUTCH-2597:
-

PR: [https://github.com/apache/nutch/pull/349]

Fixes cleanup(), also for indexer/CleaningJob.java

> NPE in updatehostdb
> ---
>
> Key: NUTCH-2597
> URL: https://issues.apache.org/jira/browse/NUTCH-2597
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.15
>Reporter: Jurian Broertjes
>Priority: Critical
>
> I get an NPE on updatehostdb. I start with a clean crawlDB & hostDB. After an 
> inject, I do an updatehostdb with -checkAll and get the following stacktrace:
> {code}
> 2018-06-13 10:45:21,958 WARN hostdb.ResolverThread - 
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:1359)
>  at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1400)
>  at 
> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:83)
>  at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
>  at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>  at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
>  at org.apache.nutch.hostdb.ResolverThread.run(ResolverThread.java:82)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> Is this related to NUTCH-2375?
> If further testing is needed, please let me know!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2597) NPE in updatehostdb

2018-06-13 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2597:
---

 Summary: NPE in updatehostdb
 Key: NUTCH-2597
 URL: https://issues.apache.org/jira/browse/NUTCH-2597
 Project: Nutch
  Issue Type: Bug
  Components: hostdb
Affects Versions: 1.15
Reporter: Jurian Broertjes


I get an NPE on updatehostdb. I start with a clean crawlDB & hostDB. After an 
inject, I do an updatehostdb with -checkAll and get the following stacktrace:

2018-06-13 10:45:21,958 WARN hostdb.ResolverThread - 
java.lang.NullPointerException
 at 
org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:1359)
 at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1400)
 at 
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:83)
 at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
 at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
 at 
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
 at org.apache.nutch.hostdb.ResolverThread.run(ResolverThread.java:82)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Is this related to NUTCH-2375?

If further testing is needed, please let me know!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509801#comment-16509801
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

Maybe it would be sufficient to only test on STATUS_DB_UNFETCHED in 
calculateLastFetchTime(datum), but fallback on CrawlDatum.getFetchTime() in the 
merger and pick the newest according to that.

That way we could also just pick the retries value from the newest one and keep 
it simple.

I'll add a PR later for review

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2018-06-12 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509686#comment-16509686
 ] 

Jurian Broertjes commented on NUTCH-2012:
-

It looks like the process() function still uses System.out.println for output, 
instead of the output StringBuilder. I can supply a small PR to fix it.

> Merge parsechecker and indexchecker
> ---
>
> Key: NUTCH-2012
> URL: https://issues.apache.org/jira/browse/NUTCH-2012
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> ParserChecker and IndexingFiltersChecker have evolved from simple tools to 
> check parsers and parsefilters resp. indexing filters to powerful tools which 
> emulate the crawling of a single URL/document:
> - check robots.txt (NUTCH-2002)
> - follow redirects (NUTCH-2004)
> Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also 
> NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge 
> them
> * either into one general debugging tool, keeping parsechecker and 
> indexchecker as aliases
> * centralize common code in one utility class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509573#comment-16509573
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

One solution would be to sum the retries of both CrawlDatums. We could do this 
only for db_unfetched or for others aswell. What do you think would be 
appropriate?

 

 

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-04-10 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432083#comment-16432083
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

PR: https://github.com/apache/nutch/pull/311

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-04-10 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2565:
---

 Summary: MergeDB incorrectly handles unfetched CrawlDatums
 Key: NUTCH-2565
 URL: https://issues.apache.org/jira/browse/NUTCH-2565
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.14
Reporter: Jurian Broertjes


I ran into this issue when merging a crawlDB originating from sitemaps into our 
normal crawlDB. CrawlDatums are merged based on output of 
AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
unfetched, this can overwrite fetchTime or other stuff.

I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
has status db_unfetched.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2543) readdb & readlinkdb to implement AbstractChecker

2018-03-22 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409584#comment-16409584
 ] 

Jurian Broertjes commented on NUTCH-2543:
-

PR: [https://github.com/apache/nutch/pull/303]

PR also includes a fix for AbstractChecker when in keepClientCnxOpen mode that 
resulted in partial/corrupted results due to re-creating BufferedReader objects.

> readdb & readlinkdb to implement AbstractChecker
> 
>
> Key: NUTCH-2543
> URL: https://issues.apache.org/jira/browse/NUTCH-2543
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, linkdb
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: patch
>
> Implement AbstractChecker in LinkDbReader & CrawlDbReader classes, so we can 
> expose them via TCP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2543) readdb & readlinkdb to implement AbstractChecker

2018-03-22 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2543:
---

 Summary: readdb & readlinkdb to implement AbstractChecker
 Key: NUTCH-2543
 URL: https://issues.apache.org/jira/browse/NUTCH-2543
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb, linkdb
Reporter: Jurian Broertjes


Implement AbstractChecker in LinkDbReader & CrawlDbReader classes, so we can 
expose them via TCP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2321) Indexing filter checker leaks threads

2018-01-08 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316670#comment-16316670
 ] 

Jurian Broertjes commented on NUTCH-2321:
-

Reworked patch, less messy. PR: https://github.com/apache/nutch/pull/272

> Indexing filter checker leaks threads
> -
>
> Key: NUTCH-2321
> URL: https://issues.apache.org/jira/browse/NUTCH-2321
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2321.patch
>
>
> Same issue as NUTCH-2320.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2017-12-18 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295034#comment-16295034
 ] 

Jurian Broertjes commented on NUTCH-2382:
-

Yeah +1 for that.

> indexer-hbase Nutch 1.x branch
> --
>
> Key: NUTCH-2382
> URL: https://issues.apache.org/jira/browse/NUTCH-2382
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
> Fix For: 1.15
>
> Attachments: NUTCH-2382-indexer-hbase-p1.patch
>
>
> I've ported the indexer-hbase for Nutch 2.x 
> (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. 
> Patch is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2431) URLFilterchecker to implement Tool-interface

2017-12-18 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294783#comment-16294783
 ] 

Jurian Broertjes commented on NUTCH-2431:
-

Yes, this is indeed resolved by NUTCH-2477

> URLFilterchecker to implement Tool-interface
> 
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2380) indexer-elastic version bump

2017-12-18 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294778#comment-16294778
 ] 

Jurian Broertjes commented on NUTCH-2380:
-

I've tested it a while back, and it's currently also running for a customer. I 
guess it should be fine for 1.14

> indexer-elastic version bump
> 
>
> Key: NUTCH-2380
> URL: https://issues.apache.org/jira/browse/NUTCH-2380
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2380-indexer-elastic-p0.patch
>
>
> The current version of the indexer-elastic plugin is not compatible with ES 
> 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
> classloader fix (NUTCH-2378) due to runtime dependency issues. 
> I didn't test compatibility with ES 2.x, so not sure if that still works.
> Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-12 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287823#comment-16287823
 ] 

Jurian Broertjes commented on NUTCH-2477:
-

Feedback is welcome

> Refactor *Checker classes to use base class for common code
> ---
>
> Key: NUTCH-2477
> URL: https://issues.apache.org/jira/browse/NUTCH-2477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: pull-request-available
>
> The various Checker class implementations have quite a bit of duplicated code 
> in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-12 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2477:

External issue URL: https://github.com/apache/nutch/pull/256

> Refactor *Checker classes to use base class for common code
> ---
>
> Key: NUTCH-2477
> URL: https://issues.apache.org/jira/browse/NUTCH-2477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: pull-request-available
>
> The various Checker class implementations have quite a bit of duplicated code 
> in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-12 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2477:
---

 Summary: Refactor *Checker classes to use base class for common 
code
 Key: NUTCH-2477
 URL: https://issues.apache.org/jira/browse/NUTCH-2477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.13
Reporter: Jurian Broertjes
Priority: Minor


The various Checker class implementations have quite a bit of duplicated code 
in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2431) Filterchecker to implement Tool-interface

2017-11-07 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241941#comment-16241941
 ] 

Jurian Broertjes commented on NUTCH-2431:
-

Will have a look at your feedback the coming week

> Filterchecker to implement Tool-interface
> -
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2431) Filterchecker to implement Tool-interface

2017-09-25 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2431:

Attachment: NUTCH-2431.patch

> Filterchecker to implement Tool-interface
> -
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2431) Filterchecker to implement Tool-interface

2017-09-25 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2431:
---

 Summary: Filterchecker to implement Tool-interface
 Key: NUTCH-2431
 URL: https://issues.apache.org/jira/browse/NUTCH-2431
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.13
Reporter: Jurian Broertjes
Priority: Minor


The current implementation of the URLFilterChecker does not allow for 
commandline config overrides. It needs to implement the Tool interface for 
this. 

Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2017-05-02 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2382:

Attachment: NUTCH-2382-indexer-hbase-p1.patch

> indexer-hbase Nutch 1.x branch
> --
>
> Key: NUTCH-2382
> URL: https://issues.apache.org/jira/browse/NUTCH-2382
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
> Attachments: NUTCH-2382-indexer-hbase-p1.patch
>
>
> I've ported the indexer-hbase for Nutch 2.x 
> (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. 
> Patch is attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Issue Comment Deleted] (NUTCH-2373) Indexer for Hbase

2017-05-02 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2373:

Comment: was deleted

(was: Nutch 1.x version)

> Indexer for Hbase
> -
>
> Key: NUTCH-2373
> URL: https://issues.apache.org/jira/browse/NUTCH-2373
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 2.3
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
> Fix For: 2.4
>
>
> Some use-case involves storing the documents in some sort of database other 
> than indexing search engines i.e. Solr, ElasticSearch.  This is a plugin to 
> send the documents to Hbase storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2373) Indexer for Hbase

2017-05-02 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992836#comment-15992836
 ] 

Jurian Broertjes commented on NUTCH-2373:
-

Nutch 1.x version

> Indexer for Hbase
> -
>
> Key: NUTCH-2373
> URL: https://issues.apache.org/jira/browse/NUTCH-2373
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 2.3
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
> Fix For: 2.4
>
>
> Some use-case involves storing the documents in some sort of database other 
> than indexing search engines i.e. Solr, ElasticSearch.  This is a plugin to 
> send the documents to Hbase storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2017-05-02 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2382:
---

 Summary: indexer-hbase Nutch 1.x branch
 Key: NUTCH-2382
 URL: https://issues.apache.org/jira/browse/NUTCH-2382
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.13
Reporter: Jurian Broertjes


I've ported the indexer-hbase for Nutch 2.x 
(https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. Patch 
is attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2380) indexer-elastic version bump

2017-05-02 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2380:

Attachment: NUTCH-2380-indexer-elastic-p0.patch

> indexer-elastic version bump
> 
>
> Key: NUTCH-2380
> URL: https://issues.apache.org/jira/browse/NUTCH-2380
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
> Attachments: NUTCH-2380-indexer-elastic-p0.patch
>
>
> The current version of the indexer-elastic plugin is not compatible with ES 
> 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
> classloader fix (NUTCH-2378) due to runtime dependency issues. 
> I didn't test compatibility with ES 2.x, so not sure if that still works.
> Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2380) indexer-elastic version bump

2017-05-02 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2380:
---

 Summary: indexer-elastic version bump
 Key: NUTCH-2380
 URL: https://issues.apache.org/jira/browse/NUTCH-2380
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.13
Reporter: Jurian Broertjes
Priority: Minor


The current version of the indexer-elastic plugin is not compatible with ES 
5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
classloader fix (NUTCH-2378) due to runtime dependency issues. 

I didn't test compatibility with ES 2.x, so not sure if that still works.

Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2378) ChildFirst plugin classloader

2017-05-01 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2378:

Attachment: NUTCH-2378-childfirst-plugin-classloader.patch

> ChildFirst plugin classloader
> -
>
> Key: NUTCH-2378
> URL: https://issues.apache.org/jira/browse/NUTCH-2378
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
> Attachments: NUTCH-2378-childfirst-plugin-classloader.patch
>
>
> While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran 
> into several nasty runtime dependency issues (both local and on Hadoop). 
> After seeking help on the mailing list, I still was unable to resolve these 
> issues and after digging further, decided to try a different plugin 
> classloader strategy. 
> The normal classloader delegates class loading requests to it's parent 
> classloader. This can cause all sorts of nasty runtime dependency version 
> conflicts (jar hell, version conflicts), since the plugin's own classloader 
> gets queried last. The child-first classloader approach tries to load a class 
> from the plugin's dependencies first and when unavailable, delegates to it's 
> parent classloader. This fixed the issues I had.
> The new approach can give runtime LinkageErrors, but these are easily 
> resolvable (see the patch for a few examples)
> I've tested the new loader a bit and am curious about others' findings.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2378) ChildFirst plugin classloader

2017-05-01 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2378:
---

 Summary: ChildFirst plugin classloader
 Key: NUTCH-2378
 URL: https://issues.apache.org/jira/browse/NUTCH-2378
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.13
Reporter: Jurian Broertjes


While working on upgrading the indexer-elastic plugin from 2.x to 5.x, I ran 
into several nasty runtime dependency issues (both local and on Hadoop). After 
seeking help on the mailing list, I still was unable to resolve these issues 
and after digging further, decided to try a different plugin classloader 
strategy. 

The normal classloader delegates class loading requests to it's parent 
classloader. This can cause all sorts of nasty runtime dependency version 
conflicts (jar hell, version conflicts), since the plugin's own classloader 
gets queried last. The child-first classloader approach tries to load a class 
from the plugin's dependencies first and when unavailable, delegates to it's 
parent classloader. This fixed the issues I had.

The new approach can give runtime LinkageErrors, but these are easily 
resolvable (see the patch for a few examples)

I've tested the new loader a bit and am curious about others' findings.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-05-11 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279842#comment-15279842
 ] 

Jurian Broertjes commented on NUTCH-2242:
-

Hi Sebastian, I've put this in the reduce() function because that is where a 
generic modified/not-modified check is done. I think it would make sense to do 
setModifiedTime() there, together with setSignature().
The one in DefaultFetchSchedule is only for setting the modified time on the 
first successful fetch.

> lastModified not always set
> ---
>
> Key: NUTCH-2242
> URL: https://issues.apache.org/jira/browse/NUTCH-2242
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.12
>
> Attachments: NUTCH-2242.patch
>
>
> I observed two issues:
> - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
> updated on the first successful fetch. 
> - When a document modification is detected (protocol- or signature-wise), the 
> modifiedTime isn't updated
> I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2242) lastModified not always set

2016-03-23 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2242:

 Flags: Patch
Patch Info: Patch Available

> lastModified not always set
> ---
>
> Key: NUTCH-2242
> URL: https://issues.apache.org/jira/browse/NUTCH-2242
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Minor
> Attachments: NUTCH-2242.patch
>
>
> I observed two issues:
> - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
> updated on the first successful fetch. 
> - When a document modification is detected (protocol- or signature-wise), the 
> modifiedTime isn't updated
> I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2242) lastModified not always set

2016-03-23 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2242:

Attachment: NUTCH-2242.patch

Initial version of patch. Please review

> lastModified not always set
> ---
>
> Key: NUTCH-2242
> URL: https://issues.apache.org/jira/browse/NUTCH-2242
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Minor
> Attachments: NUTCH-2242.patch
>
>
> I observed two issues:
> - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
> updated on the first successful fetch. 
> - When a document modification is detected (protocol- or signature-wise), the 
> modifiedTime isn't updated
> I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2242) lastModified not always set

2016-03-23 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2242:
---

 Summary: lastModified not always set
 Key: NUTCH-2242
 URL: https://issues.apache.org/jira/browse/NUTCH-2242
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.11
Reporter: Jurian Broertjes
Priority: Minor


I observed two issues:
- When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
updated on the first successful fetch. 
- When a document modification is detected (protocol- or signature-wise), the 
modifiedTime isn't updated

I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2203:

Attachment: NUTCH-2203.patch

Attached a patch to fix this.

> Suffix URL filter can't handle trailing/leading whitespaces
> ---
>
> Key: NUTCH-2203
> URL: https://issues.apache.org/jira/browse/NUTCH-2203
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Trivial
> Attachments: NUTCH-2203.patch
>
>
> I ran into an issue where some lines in suffix-urlfilter.txt contained 
> trailing whitespaces and caused the filtering to misbehave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2203:
---

 Summary: Suffix URL filter can't handle trailing/leading 
whitespaces
 Key: NUTCH-2203
 URL: https://issues.apache.org/jira/browse/NUTCH-2203
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.11
Reporter: Jurian Broertjes
Priority: Trivial


I ran into an issue where some lines in suffix-urlfilter.txt contained trailing 
whitespaces and caused the filtering to misbehave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-01-07 Thread Jurian Broertjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurian Broertjes updated NUTCH-2197:

Attachment: NUTCH-2197.patch

I've attached a patch with initial support for Solr5 + SolrCloud. Please review 
it.

> Add solr5 solrcloud indexer support
> ---
>
> Key: NUTCH-2197
> URL: https://issues.apache.org/jira/browse/NUTCH-2197
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.12
>Reporter: Jurian Broertjes
>Priority: Minor
> Attachments: NUTCH-2197.patch
>
>
> Nutch cannot index to Solr5. Also proper SolrCloud support is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)