[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-2.x.v1.patch Patch for 2.x. If there are no objections, would commit in coming days. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-trunk.v5.patch Thanks Lewis :) I have corrected the usage message. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-trunk.v4.patch Hey Lewis, Thanks for pointing that out :) I have updated the patch. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: CC.robots.multiple.agents.v2.patch Hi Ken, I have added a test case to CC for the change. (CC.robots.multiple.agents.v2.patch) Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-trunk.v2.patch Added a patch for nutch trunk (NUTCH-1031-trunk.v2.patch). If nobody has objection, i will work on corresponding patch for 2.x. Summary of the changes done: - Removed RobotRules class as CC provides a replacement: BaseRobotRules - Moved RobotRulesParser from http plugin in account to NUTCH-1513, other protocols might share the it. - Added HttpRobotRulesParser which will be responsible for getting the robots file using http protocol. - Changed references from old nutch classes to classes from CC. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: CC.robots.multiple.agents.patch I looked at the source code of CC to understand how it works. I have identified the change to be done to CC so that it supports multiple user agents. While testing the same, I have found that there a semantic difference in the way CC works as compared to legacy nutch parser. *What CC does:* It will split the _http.robots.agents_ over comma (the change that i did locally) It scans the robots file line by line, each time finding if there is a match of the current User-Agent from file with any one of from _http.robots.agents_. If match is found it will take all the corresponding rules for that agent and stop further parsing. {noformat}robots file User-Agent: Agent1 #foo Disallow: /a User-Agent: Agent2 Agent3 Disallow: /d http.robots.agents: Agent2,Agent1 Path: /a{noformat} For the example above, as soon as first line of robots file is scanned, a match for Agent1 is found. It will scan all the corresponding rules for that agent and will store only this information: {noformat}User-Agent: Agent1 Disallow: /a{noformat} Rest all stuff is neglected. *What nutch robots parser does:* It will split the _http.robots.agents_ over comma. It scans ALL the lines of the robots file and evaluates the matches in terms of the precedence of the user agents. For above example, the rules corresponding to both Agent2 and Agent1 have a match in robots file, but as Agent2 comes first in _http.robots.agents_, it is given priority and the rules stored will be: {noformat}User-Agent: Agent2 Disallow: /d{noformat} If we want to leave behind the precendence based thing and adopt the model in CC, then I have a small patch for crawler-commons (CC.robots.multiple.agents.patch). Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1031: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.6 We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira