[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662722#comment-13662722 ] Hudson commented on NUTCH-1513: --- Integrated in Nutch-trunk #2210 (See [https://builds.apache.org/job/Nutch-trunk/2210/]) NUTCH-1513 Support Robots.txt for Ftp urls (Revision 1484638) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1484638 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 2.3, 1.8 Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, NUTCH-1513.trunk.v2.patch As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662728#comment-13662728 ] Hudson commented on NUTCH-1513: --- Integrated in Nutch-nutchgora #613 (See [https://builds.apache.org/job/Nutch-nutchgora/613/]) NUTCH-1513 Support Robots.txt for Ftp urls (Revision 1484637) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1484637 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 2.3, 1.8 Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, NUTCH-1513.trunk.v2.patch As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649026#comment-13649026 ] Tejas Patil commented on NUTCH-1513: One thing that I forgot to mention: The change picks up the agent names from http.agent.name and http.robots.agents. I could have added ftp.agent.name etc.. new configs but I dont see a point on doing that because both these configs would generally carry same values and so creating new ones would just add to the whole nest of already existing configs. What say ? Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 2.3, 1.8 Attachments: NUTCH-1513.trunk.patch As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545691#comment-13545691 ] Tejas Patil commented on NUTCH-1513: Hi Lewis, Thanks for your suggestion. I think that first migrating Http to crawler-commons ([NUTCH-1031|https://issues.apache.org/jira/browse/NUTCH-1031]) and then coming back to this one will be better thing to do. I have done the changes for Http and attached the patch to the respective Jira. Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Lewis John McGibbney Priority: Minor Labels: robots.txt As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545512#comment-13545512 ] Lewis John McGibbney commented on NUTCH-1513: - Hi Tejas, of course this is entirely down to you, I think that using CC would be really neat. We (in time) plan on migrating over to CC so now seems as good a time as any. As I said this is however entirely down to you. Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Lewis John McGibbney Priority: Minor Labels: robots.txt As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543718#comment-13543718 ] Markus Jelsma commented on NUTCH-1513: -- I don't know why Nutch doesn't have robots support for FTP but it should have. This feature should also be enabled at all times without possibility to disable it via configuration. Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Priority: Minor Labels: robots.txt As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543720#comment-13543720 ] Tejas Patil commented on NUTCH-1513: For this has to be supported I have 2 approaches: # Implement robots handling for Ftp in a similar way as its been done for Http protocol. Parsing performed by Nutch. # Same as #1 but use Crawler-Commons to get the parsing done. There is already [NUTCH-1031|https://issues.apache.org/jira/browse/NUTCH-1031] filed for its integration with Nutch. The [last release|http://code.google.com/p/crawler-commons/downloads/list] of Crawler-Commons was in July 2011. Looks like it aint under active development. Please let me know your comments about these approaches. Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Priority: Minor Labels: robots.txt As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira