Stop Local Job Threads

2017-06-12 Thread Ben Vachon
Hi all, I'm running Nutch 2.3.1 over a standalone HBase instance with no yarn or hdfs. This means that the jobs get run through org.apache.hadoop.mapred.LocalJobRunner which doesn't support killing mapred tasks. I've set it up so that all of the nutch threads get run in the same ThreadGroup a

rel="canonical" attribute

2017-05-18 Thread Ben Vachon
Hi all, I'm wondering how Nutch 2.3.1 handles links with the rel="canonical" attribute. I found this ticket: https://issues.apache.org/jira/browse/NUTCH-710 which is from version 1.1 and doesn't seem to have ever been resolved. Are all canonical links still just rejected? Are there any plans

Re: delete STATUS_GONE pages from index

2017-05-16 Thread Ben Vachon
you need to set db.update.purge.404=true ? Tom On 15/05/17 20:35, Ben Vachon wrote: Hi all, I'm working with Nutch 2.3.1 and I have a problem that I'm hoping the community can help me with. A page is fetched successfully and subsequently indexed during the initial run of a crawle

delete STATUS_GONE pages from index

2017-05-15 Thread Ben Vachon
Hi all, I'm working with Nutch 2.3.1 and I have a problem that I'm hoping the community can help me with. A page is fetched successfully and subsequently indexed during the initial run of a crawler, but later, the page no longer exists on the server (404 not found). When I run the crawler agai

ConnectionLoss with hbase 1.1.2

2017-04-19 Thread Ben Vachon
Hi all, It's a requirement for our platform to use the hbase-client-1.1.2 jar and we can't have multiple versions of hbase-client so I need to get nutch-2.3.1 to use hbase-client-1.1.2 rather than 0.98.8-hadoop2. */For these tests, I have been pointing nutch at a standalone hbase running on

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

2017-04-11 Thread Ben Vachon
Hi Fabio, I believe there is a property generate.max.distance in nutch-site.xml in the newest releases that you can use to configure max depth. On 04/11/2017 06:20 AM, Fabio Ricci wrote: Hi Sebastian thank you for your message. That does not help me really… Yes I new the output of ./crawl

Re: Nutch Plugins Source Control

2017-04-07 Thread Ben Vachon
<https://issues.apache.org/jira/browse/NUTCH-2292> HTH Julien Thanks very much, Ben V. On 04/07/2017 09:48 AM, lsroudi abdel wrote: hi, i think you should add it in the ivy/ivy.xml and and just run ant runtime On Thu, Apr 6, 2017 at 9:35 PM, Ben Vachon wrote: Hi all, I'

Re: Nutch Plugins Source Control

2017-04-07 Thread Ben Vachon
n V. On 04/07/2017 09:48 AM, lsroudi abdel wrote: hi, i think you should add it in the ivy/ivy.xml and and just run ant runtime On Thu, Apr 6, 2017 at 9:35 PM, Ben Vachon wrote: Hi all, I'm working on a project that gets Nutch 2.3.1 from maven and uses it to set off crawl jobs which are con

Nutch Plugins Source Control

2017-04-06 Thread Ben Vachon
Hi all, I'm working on a project that gets Nutch 2.3.1 from maven and uses it to set off crawl jobs which are configurable in our own UI and through our own search platform's properties. To allow specific configuration of crawlers, I want to use many of the default plugins that come with a Nutch