[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558195#comment-13558195
 ] 

Julien Nioche commented on NUTCH-1031:
--------------------------------------

bq. 1. Continue to have the legacy code for parsing robots file. 
bq. 2. As an add-in, crawler-commons can be employed for the parsing. User can 
pick based on a config parameter with a note indicating that #2 wont work with 
multiple HTTP agents.

2 is an overkill IMHO. the existing code works fine and the point in moving to 
CC was to get rid of some of our code, not make it bigger with yet another 
configuration. 

Lewis : donating out code is a good idea but in the case of the robots parsing 
it's more about modifying the existing one in CC. I haven't had time to look at 
robot parsing in CC and am not familiar with it but it would be a good thing to 
improve it. In the meantime let's go for option 1. Thanks!

                
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>
>                 Key: NUTCH-1031
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1031
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>
>         Attachments: NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to