RE: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 + Yarn

2016-08-05 Thread Markus Jelsma
Hello Shubham, If youy have set fetcher.parse to true, then do not execute the parse job, because it is already parsed. I don't know about 2.x, but in 1.x, you cannot parse a segment which was already parsed. If you use some crawl script that has parsing hardcoded in it, despite fetcher.parse,

Re: Protocol change to https

2016-08-05 Thread Arora, Madhvi
Thank you very much! On 8/5/16, 2:13 PM, "Markus Jelsma" wrote: >I am not sure which version is was added, you'd have to check CHANGES.txt, but >upgrading is usually a good idea and very simple. >Markus > > > >-Original message- >> From:Arora, Madhvi

RE: Protocol change to https

2016-08-05 Thread Markus Jelsma
I am not sure which version is was added, you'd have to check CHANGES.txt, but upgrading is usually a good idea and very simple. Markus -Original message- > From:Arora, Madhvi > Sent: Friday 5th August 2016 19:53 > To: user@nutch.apache.org > Subject:

Re: Protocol change to https

2016-08-05 Thread Arora, Madhvi
Markus so to crawl https and http urls successfully we just need to switch to a newer version of Nutch I.e. Higher than Nutch 1.10? On 8/5/16, 12:47 PM, "Markus Jelsma" wrote: >Hello - see inline. >Markus > >-Original message- >> From:Arora, Madhvi

RE: Protocol change to https

2016-08-05 Thread Markus Jelsma
Hello - see inline. Markus -Original message- > From:Arora, Madhvi > Sent: Friday 5th August 2016 18:03 > To: user@nutch.apache.org > Subject: Protocol change to https > > Hi, > > We are using Nutch 1.10 and Solr 5. We have around 10 different web sites

Protocol change to https

2016-08-05 Thread Arora, Madhvi
Hi, We are using Nutch 1.10 and Solr 5. We have around 10 different web sites that are crawled regularly. We are changing protocol of a few websites from http to https. So we will have a mix bag of http and https protocols. I checked in nutch user-mail archive and get that we need to change

schema version (UNCLASSIFIED)

2016-08-05 Thread Musshorn, Kris T CTR USARMY RDECOM ARL (US)
CLASSIFICATION: UNCLASSIFIED Is there a particular schema.xml file I should be using with nutch 1.12 to index into solr 6.1.0? Im trying to debug indexing error: Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at