[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-05-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662722#comment-13662722
 ] 

Hudson commented on NUTCH-1513:
---

Integrated in Nutch-trunk #2210 (See 
[https://builds.apache.org/job/Nutch-trunk/2210/])
NUTCH-1513 Support Robots.txt for Ftp urls (Revision 1484638)

 Result = SUCCESS
tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1484638
Files : 
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* 
/nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
* 
/nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java


 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, 
 NUTCH-1513.trunk.v2.patch


 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-05-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662728#comment-13662728
 ] 

Hudson commented on NUTCH-1513:
---

Integrated in Nutch-nutchgora #613 (See 
[https://builds.apache.org/job/Nutch-nutchgora/613/])
NUTCH-1513 Support Robots.txt for Ftp urls (Revision 1484637)

 Result = SUCCESS
tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1484637
Files : 
* /nutch/branches/2.x/CHANGES.txt
* 
/nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
* 
/nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpRobotRulesParser.java


 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, 
 NUTCH-1513.trunk.v2.patch


 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-05-04 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649026#comment-13649026
 ] 

Tejas Patil commented on NUTCH-1513:


One thing that I forgot to mention: The change picks up the agent names from 
http.agent.name and http.robots.agents. I could have added ftp.agent.name etc.. 
new configs but I dont see a point on doing that because both these configs 
would generally carry same values and so creating new ones would just add to 
the whole nest of already existing configs. What say ?

 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1513.trunk.patch


 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-07 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545691#comment-13545691
 ] 

Tejas Patil commented on NUTCH-1513:


Hi Lewis,
Thanks for your suggestion. I think that first migrating Http to 
crawler-commons ([NUTCH-1031|https://issues.apache.org/jira/browse/NUTCH-1031]) 
and then coming back to this one will be better thing to do. I have done the 
changes for Http and attached the patch to the respective Jira.

 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Assignee: Lewis John McGibbney
Priority: Minor
  Labels: robots.txt

 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545512#comment-13545512
 ] 

Lewis John McGibbney commented on NUTCH-1513:
-

Hi Tejas, of course this is entirely down to you, I think that using CC would 
be really neat. We (in time) plan on migrating over to CC so now seems as good 
a time as any. As I said this is however entirely down to you.

 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Assignee: Lewis John McGibbney
Priority: Minor
  Labels: robots.txt

 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-04 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543718#comment-13543718
 ] 

Markus Jelsma commented on NUTCH-1513:
--

I don't know why Nutch doesn't have robots support for FTP but it should have. 
This feature should also be enabled at all times without possibility to disable 
it via configuration.

 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Priority: Minor
  Labels: robots.txt

 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-04 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543720#comment-13543720
 ] 

Tejas Patil commented on NUTCH-1513:


For this has to be supported I have 2 approaches:
# Implement robots handling for Ftp in a similar way as its been done for Http 
protocol. Parsing performed by Nutch.
# Same as #1 but use Crawler-Commons to get the parsing done. There is already 
[NUTCH-1031|https://issues.apache.org/jira/browse/NUTCH-1031] filed for its 
integration with Nutch. The [last 
release|http://code.google.com/p/crawler-commons/downloads/list] of 
Crawler-Commons was in July 2011. Looks like it aint under active development.

Please let me know your comments about these approaches.

 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Priority: Minor
  Labels: robots.txt

 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira