Re: HTTP Post Authentication

2015-03-12 Thread Tizy Ninan
Hi Lewis, Thank you for the reply. I tried by providing the parameters specified in the httpclient-auth.xml template file. But while crawling I am getting the following warnings. WARN httpclient.Http: Bad auth conf file: root element found in httpclient-auth.xml - must be WARN httpclient.Http:

Re: Nutch 1.9 and Hadoop 1.2.1 Domains Crawl Depth

2015-03-12 Thread Svyatoslav Lavryk
Thank you very much Nirav, it helped. On Wed, Mar 11, 2015 at 7:20 PM, Nirav Thaker wrote: > You will need to put '_maxdepth_' metadata in seed file like following: > > http://domain1.com/abc _maxdepth_=2 some.other.metadata=xys > > > http://domain2.com/xyz _maxdepth_=99 some.other.metadata=abc

Re: Nutch 2.3 Build Error, Please help

2015-03-12 Thread Lewis John Mcgibbney
Hi Arthur, On Thu, Mar 12, 2015 at 12:20 AM, wrote: > > I downloaded http://svn.apache.org/repos/asf/nutch/branches/2.x/ > re-run the compilation, still got the the error > > Question: Are the following dependencies are correctly set in my ivy.xml? > > conf="*->default" /> > rev="2.2.3

Re: HTTP Post Authentication

2015-03-12 Thread Lewis John Mcgibbney
Hi Tizy, On Thu, Mar 12, 2015 at 12:20 AM, wrote: > > Is there any detailed step by step explanation on how to implement > HTTPPostAuthentication on Nutch 1.10.? > > https://github.com/apache/nutch/blob/trunk/conf/httpclient-auth.xml.template#L61-L105 https://wiki.apache.org/nutch/HttpPostAuthen

RE: Handling servers with wrong Last Modified HTTP header

2015-03-12 Thread Markus Jelsma
Hello Jorge, This is an interesting but very complicated issue. First of all, do not rely on HTTP headers, they are incorrect on any scale larger than very small. This is true for Last-Modified due to dynamic CMS' but for many other headers. You can even expect website descriptions in headers s

Re: Nutch 1.9 and Hadoop 1.2.1 Domains Crawl Depth

2015-03-12 Thread Nirav Thaker
You will need to put '_maxdepth_' metadata in seed file like following: http://domain1.com/abc _maxdepth_=2 some.other.metadata=xys http://domain2.com/xyz _maxdepth_=99 some.other.metadata=abc HTH On 03/11/2015 01:10 PM, Svyatoslav Lavryk wrote: Hello, We use Nutch 1.9 with Hadoop 1.2.1 for

RE: Nutch documents have huge scores in Solr

2015-03-12 Thread Markus Jelsma
Hello Jigal - every distribution of Nutch configuration should in my opinion disable OPIC-scoring. In fact, i think we should remove it from nutch-default.xml altogether. Markus -Original message- > From:Jigal van Hemert | alterNET internet BV > Sent: Wednesday 11th March 2015 9:40 > To

Re: Crawling Pages from Single Domain

2015-03-12 Thread Siddharth Shah
Hi Jonathan, Apologies for my delayed response. Thank you for the pointer the crawl worked as expected, I needed to tweak regex filtering. Thank you once again, Sidharth On Wed, Mar 11, 2015 at 4:46 AM, Jonathan Cooper-Ellis < jcooperel...@cloudera.com> wrote: > Hi Siddharth,

HTTP Post Authentication

2015-03-12 Thread Tizy Ninan
Hi, Is there any detailed step by step explanation on how to implement HTTPPostAuthentication on Nutch 1.10.? Thanks and Regards, Tizy