Re: nutch elpais.com

2014-06-16 Thread Julien Nioche
Salut Yann, Not really answering your question but where did you get this config from? Some of its elements have been long deprecated (query-*, response-*, summary-*) Julien On 15 June 2014 10:20, Yann Levreau yann.levr...@gmail.com wrote: hi everyone ! I'm sorry to disturb you but i need

[jira] [Created] (NUTCH-1793) HttpRobotRulesParser not configured properly = http.robots.403.allow property is not read

2014-06-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1793: Summary: HttpRobotRulesParser not configured properly = http.robots.403.allow property is not read Key: NUTCH-1793 URL: https://issues.apache.org/jira/browse/NUTCH-1793

Re: nutch elpais.com

2014-06-16 Thread Yann Levreau
You're right, I need to clean these config files. I think these plugins came from Nutch 1.7 (bad copy/paste :) ) I have news with my issue. Actually there were two issues : 1) outlinks are not set in the WebPage : In ParseUtil.java (line195), we have : *if

RE: nutch elpais.com

2014-06-16 Thread Markus Jelsma
Hi - sites such as nytimes are hard to crawl. The only way to work around the redirect problem is to identify why it does so and then have Nutch send the appropriate HTTP headers so it won't. It may be a cookie, or a browser-like user-agent string. AFAIK Nutch has no facility yet to send

Build failed in Jenkins: Nutch-nutchgora #1045

2014-06-16 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/1045/ -- [...truncated 3068 lines...] init-plugin: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-nutchgora/ws/2.x/ivy/ivysettings.xml