Tika boilerpipe extractors

2018-06-27 Thread Arora, Madhvi
Hi All, Note reposting my question since looks like earlier one got posted on wrong thread. We are using Nutch 1.13 and Solr 6. I am trying to use one of the parsers that come with Tika boilerpipe support. I am getting best result for pages where there are only outlinks with CanolaExtracto

Tika boilerpipe extractors

2018-06-27 Thread Arora, Madhvi
Hi All, We are using Nutch 1.13 and Solr 6. I am trying to use one of the parsers that come with Tika boilerpipe support. I am getting best result for pages where there are only outlinks with CanolaExtractor in a page like this: https://support.automationdirect.com/faq/dl205.php But checking

Nutch 1.x and Solr compatible versions

2017-05-02 Thread Arora, Madhvi
Hi, We currently use Nutch 1.10 and SOLR 4.x. We are in a process of upgrading both software. I wanted to find out if the latest version of Nutch 1.13 is compatible with SOLR 6. Also, if there is any documentation that I can use for upgrading Nutch that will be compatible with SOLR 6. Thanks i

Re: Upgrade to Nutch 1.12

2016-08-19 Thread Arora, Madhvi
On Thu, Aug 18, 2016 at 7:08 AM, wrote: > >> >> From: "Arora, Madhvi" >> To: "user@nutch.apache.org" >> Cc: >> Date: Wed, 17 Aug 2016 13:30:09 + >> Subject: Upgrade to Nutch 1.12 >> Hi, >> >> >> I wanted to find out how t

Upgrade to Nutch 1.12

2016-08-17 Thread Arora, Madhvi
Hi, I wanted to find out how to correct the issue below and will appreciate any help. I am trying to upgrade to Nutch 1.12. I am using solr 5.3.1. The reason I am upgrading are: 1: https crawling 2: Boilerplate canola extraction through tika The only problem so far I am having is an IOExcep

Re: Protocol change to https

2016-08-16 Thread Arora, Madhvi
ind of related to what I need. On 8/5/16, 2:18 PM, "Arora, Madhvi" wrote: >Thank you very much! > > > > >On 8/5/16, 2:13 PM, "Markus Jelsma" wrote: > >>I am not sure which version is was added, you'd have to check CHANGES.txt, >&g

Re: Protocol change to https

2016-08-05 Thread Arora, Madhvi
Thank you very much! On 8/5/16, 2:13 PM, "Markus Jelsma" wrote: >I am not sure which version is was added, you'd have to check CHANGES.txt, but >upgrading is usually a good idea and very simple. >Markus > > > >-Original message- >> From:Arora, Madhvi >> Sent: Friday 5th August 201

Re: Protocol change to https

2016-08-05 Thread Arora, Madhvi
Markus so to crawl https and http urls successfully we just need to switch to a newer version of Nutch I.e. Higher than Nutch 1.10? On 8/5/16, 12:47 PM, "Markus Jelsma" wrote: >Hello - see inline. >Markus > >-Original message- >> From:Arora, Madhvi >> Sent: Friday 5th August 2016

Protocol change to https

2016-08-05 Thread Arora, Madhvi
Hi, We are using Nutch 1.10 and Solr 5. We have around 10 different web sites that are crawled regularly. We are changing protocol of a few websites from http to https. So we will have a mix bag of http and https protocols. I checked in nutch user-mail archive and get that we need to change pr