[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652805#comment-13652805 ] Tejas Patil commented on NUTCH-1031: I had forgot to add crawler-commons dependency in pom.xml. Just committed that to trunk(rev 1480551) and 2.x (rev 1480551). Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7, 2.2 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13653012#comment-13653012 ] Lewis John McGibbney commented on NUTCH-1031: - Hi Tejas, A quick note on keeping pom.xml up-to-date, whenever we do a release pom.xml is written absolutely up-to-date based upon the contents and configuration of ivy.xml. What this means is that every tag branch of Nutch has a completely accurate pom.xml and that the current development branches do not. I will make sure to update the pom.xml in the forthcoming releases. Regardless thank you the attention to detail here. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7, 2.2 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644754#comment-13644754 ] Lewis John McGibbney commented on NUTCH-1031: - +1 from me Tejas. Unit tests all pass fine and some tests I did locally we're good as well. CLI looks good. Documentation in the patch is really nice. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644868#comment-13644868 ] Hudson commented on NUTCH-1031: --- Integrated in Nutch-nutchgora #587 (See [https://builds.apache.org/job/Nutch-nutchgora/587/]) NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 1477319) Result = FAILURE tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1477319 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/ivy/ivy.xml * /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java * /nutch/branches/2.x/src/java/org/apache/nutch/protocol/EmptyRobotRules.java * /nutch/branches/2.x/src/java/org/apache/nutch/protocol/Protocol.java * /nutch/branches/2.x/src/java/org/apache/nutch/protocol/RobotRulesParser.java * /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java * /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java * /nutch/branches/2.x/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java * /nutch/branches/2.x/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java * /nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7, 2.2 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624194#comment-13624194 ] Tejas Patil commented on NUTCH-1031: I have removed the @author tag and ported the checks from 2.x to the patch as per suggestion from [~wastl-nagel]. Will commit the changes shortly to trunk and start work on porting these changes to 2.x. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624291#comment-13624291 ] Hudson commented on NUTCH-1031: --- Integrated in Nutch-trunk #2156 (See [https://builds.apache.org/job/Nutch-trunk/2156/]) NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 1465159) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1465159 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java * /nutch/trunk/src/java/org/apache/nutch/protocol/EmptyRobotRules.java * /nutch/trunk/src/java/org/apache/nutch/protocol/Protocol.java * /nutch/trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java * /nutch/trunk/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java * /nutch/trunk/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603406#comment-13603406 ] Sebastian Nagel commented on NUTCH-1031: +1 (nothing to complain) P.S.: see [~jnioche]'s comment in NUTCH-1541 about the {{@author}} tag (a formality you couldn't know about) Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603482#comment-13603482 ] Sebastian Nagel commented on NUTCH-1031: There are differences between trunk and 2.x: * in org.apache.nutch.protocol.http.api.RobotRulesParser (lib-http) 2.x does additional plausibility checks for properties {{http.agent.name}} and {{http.robots.agents}} Maybe that's worth to take also into trunk, also with respect to porting this issue to 2.x :) Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597515#comment-13597515 ] Lewis John McGibbney commented on NUTCH-1031: - Hi Tejas. Sorry for taking forever to get around to this. * I really like to documentation within the patch. Big +1 for this * Test all pass flawlessly. * I like the retention of the main() method in o.a.n.p.RobotRulesParser I've tested this on several websites, including many directories within sites like bbc.co.uk (check out the robots.txt) I am +1 for this Tejas. Good work on this one, its been a long time in coming to Nutch. I am keen to hear from others. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594391#comment-13594391 ] Lewis John McGibbney commented on NUTCH-1031: - MHi Tejas. If you go to search maven you will see the 0.2 release of crawler commons. You will be able to pull this with ivy no bother. @Tejas, I agree with your views on keeping CC in core ivy.xml as it is likely that we will use it for the sitemaps at some stage as well. Great work Tejas. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585730#comment-13585730 ] lufeng commented on NUTCH-1031: --- Hi Tejas 1. The EmptyRobotRules class is not delete in patch NUTCH-1031-trunk.v2.patch file. 2. Shoud we add CC dependency in ivy.xml configuration. 3. Can we create a RobotRulesParser as a nutch plugin and extract the Protocol#getRobotRules method. So we can move the CC dependency from nutc-core to nutch-plugin. Thanks Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585467#comment-13585467 ] Tejas Patil commented on NUTCH-1031: Hi Sebastian, Thanks for your time and suggesting the changes. regarding the junits: I would remove those from nutch as CC already has a their own tests and no point in testing it again in nutch. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584824#comment-13584824 ] Sebastian Nagel commented on NUTCH-1031: Hi Tejas, a test of NUTCH-1031-trunk.v2.patch in combination with crawler-commons-2.0 shows: - protocol.RobotRulesParser.main does not work properly: ** robotName is not filled properly by the agent-name+ arguments ** parsed rules are printed in the Object string representation (e.g., SimpleRobotRules@2c2f1921) - testRobotsTwoAgents failed. However, the tests are quite complex: Shouldn't we trust on the exhaustive tests by crawler-commons? A simple test may be sufficient to test the basic functionality and, eg. agent names separated by comma. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583013#comment-13583013 ] Tejas Patil commented on NUTCH-1031: Hey Ken, A gentle reminder for releasing CC. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583340#comment-13583340 ] Lewis John McGibbney commented on NUTCH-1031: - Hi Tejas. We released it ;) Really sorry for not updating https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583013#comment-13583013] CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly administrators -- *Lewis* Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583662#comment-13583662 ] Tejas Patil commented on NUTCH-1031: Hi Lewis, I should have checked on the main page of CC before asking over jira. Anyways, thanks for news :) Regarding - delegating the functionality: I had already done that change for both 1.x and 2.x last month. Was waiting for the release of CC. If possible, can you review the patches ? Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583664#comment-13583664 ] Tejas Patil commented on NUTCH-1031: @Dev: I am planning to commit this change in coming days. If anyone has suggestions please feel free to share your thoughts. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560877#comment-13560877 ] Ken Krugler commented on NUTCH-1031: I've rolled this into trunk at crawler-commons. Next step is to roll a release. Not sure when I'll get to that, but on my list for this week. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560420#comment-13560420 ] Ken Krugler commented on NUTCH-1031: Hi Tejas, I've been on the road, but I'll check out your patch when I return back to my office tomorrow. Thanks for updating it with a test case! -- Ken Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558340#comment-13558340 ] Ken Krugler commented on NUTCH-1031: Hi Tejas - I've looked at your patch, and (assuming there's not a requirement to support precedence in the user agent name list) it seems like a valid change. Based on the RFC (http://www.robotstxt.org/norobots-rfc.txt) robot names shouldn't have commas, so splitting on that seems safe. Do you have a unit test to verify proper behavior? If so, I'd be happy to roll that into CC. -- Ken Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558349#comment-13558349 ] Tejas Patil commented on NUTCH-1031: Hi Ken, Thanks for reviewing the patch. I will include a test case in patch. Before that, a bigger question is whether Nutch should adopt the parsing model in CC and forget about the precedence. BTW: Did you find any error in my understanding about how CC parses robots ? Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558050#comment-13558050 ] Lewis John McGibbney commented on NUTCH-1031: - Is the issue with multiple agents the only downside to using CC just now? I think your proposal is great Tejas however if we are looking into supporting CC for more than just robots.txt parsing then maybe we ought to look into donating this aspect of the Nutch code? Wdyt? Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558195#comment-13558195 ] Julien Nioche commented on NUTCH-1031: -- bq. 1. Continue to have the legacy code for parsing robots file. bq. 2. As an add-in, crawler-commons can be employed for the parsing. User can pick based on a config parameter with a note indicating that #2 wont work with multiple HTTP agents. 2 is an overkill IMHO. the existing code works fine and the point in moving to CC was to get rid of some of our code, not make it bigger with yet another configuration. Lewis : donating out code is a good idea but in the case of the robots parsing it's more about modifying the existing one in CC. I haven't had time to look at robot parsing in CC and am not familiar with it but it would be a good thing to improve it. In the meantime let's go for option 1. Thanks! Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557930#comment-13557930 ] Tejas Patil commented on NUTCH-1031: After waiting for more than a week, I think that there is low chance of getting a fix / change from crawler-commons. I propose following: 1. Continue to have the legacy code for parsing robots file. 2. As an add-in, crawler-commons can be employed for the parsing. User can pick based on a config parameter with a note indicating that #2 wont work with multiple HTTP agents. Should this be fine ? Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545958#comment-13545958 ] Julien Nioche commented on NUTCH-1031: -- well we have 2 separate params : http.agent.name which is a single value sent to the servers when fetching and http.robots.agents which can have multiple values and is used for parsing robots. The value of this parameter SHOULD be split based on commas. I don't think CC supports multiple values for http.robots.agents, but I'll ask Ken to be sure. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545989#comment-13545989 ] Markus Jelsma commented on NUTCH-1031: -- I think it would be a _very_ good thing to maintain support for multiple users agents as it provides flexibility to crawler operators to be lenient on how webmasters spell the crawler name in their robots.txt. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546639#comment-13546639 ] Tejas Patil commented on NUTCH-1031: The current nutch robots parsing logic is uses the later approach for parsing. Having a new API for passing a list of robots names would be a clean solution. Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398338#comment-13398338 ] Lewis John McGibbney commented on NUTCH-1031: - crawler-commons is available within maven central. Are we still interested in delegating our parsing code to crawler commons? What is the community like over at crawler-commons e.g. if we find bugs in the code how when will/could they get fixed? Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.6 We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398340#comment-13398340 ] Julien Nioche commented on NUTCH-1031: -- crawler-commons is not super active and I have been pretty much the only person actively involved. There have been bugfixes since the release but not necessarily committed IIRC The robots parsing is working OK in Nutch and we have loads of other things to work on which are probably more important :-) Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.6 We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185285#comment-13185285 ] Lewis John McGibbney commented on NUTCH-1031: - Hi Julien, out of shear curiosity, how do we currently parse robots.txt? I found some files (which don't do parsing) in o.a.n.protocol but I've never known what we use for robots.txt Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.5 We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira