[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-05-09 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652805#comment-13652805
 ] 

Tejas Patil commented on NUTCH-1031:


I had forgot to add crawler-commons dependency in pom.xml. 
Just committed that to trunk(rev 1480551) and 2.x (rev 1480551).

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7, 2.2

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, 
 NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, 
 NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-05-09 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13653012#comment-13653012
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

Hi Tejas,
A quick note on keeping pom.xml up-to-date, 
whenever we do a release pom.xml is written absolutely up-to-date based upon 
the contents and configuration of ivy.xml.
What this means is that every tag branch of Nutch has a completely accurate 
pom.xml and that the current development branches do not.
I will make sure to update the pom.xml in the forthcoming releases.
Regardless thank you the attention to detail here.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7, 2.2

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, 
 NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, 
 NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-29 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644754#comment-13644754
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

+1 from me Tejas. Unit tests all pass fine and some tests I did locally we're 
good as well. CLI looks good. Documentation in the patch is really nice.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, 
 NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, 
 NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644868#comment-13644868
 ] 

Hudson commented on NUTCH-1031:
---

Integrated in Nutch-nutchgora #587 (See 
[https://builds.apache.org/job/Nutch-nutchgora/587/])
NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 
1477319)

 Result = FAILURE
tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1477319
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/ivy/ivy.xml
* /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java
* /nutch/branches/2.x/src/java/org/apache/nutch/protocol/EmptyRobotRules.java
* /nutch/branches/2.x/src/java/org/apache/nutch/protocol/Protocol.java
* /nutch/branches/2.x/src/java/org/apache/nutch/protocol/RobotRulesParser.java
* 
/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* 
/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* 
/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
* 
/nutch/branches/2.x/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
* 
/nutch/branches/2.x/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
* 
/nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
* 
/nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java


 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7, 2.2

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, 
 NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, 
 NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-05 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624194#comment-13624194
 ] 

Tejas Patil commented on NUTCH-1031:


I have removed the @author tag and ported the checks from 2.x to the patch as 
per suggestion from [~wastl-nagel]. Will commit the changes shortly to trunk 
and start work on porting these changes to 2.x.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, 
 NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624291#comment-13624291
 ] 

Hudson commented on NUTCH-1031:
---

Integrated in Nutch-trunk #2156 (See 
[https://builds.apache.org/job/Nutch-trunk/2156/])
NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 
1465159)

 Result = SUCCESS
tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1465159
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/protocol/EmptyRobotRules.java
* /nutch/trunk/src/java/org/apache/nutch/protocol/Protocol.java
* /nutch/trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
* 
/nutch/trunk/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
* 
/nutch/trunk/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
* 
/nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java


 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, 
 NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603406#comment-13603406
 ] 

Sebastian Nagel commented on NUTCH-1031:


+1 (nothing to complain)

P.S.: see [~jnioche]'s comment in NUTCH-1541 about the {{@author}} tag (a 
formality you couldn't know about)

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, 
 NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603482#comment-13603482
 ] 

Sebastian Nagel commented on NUTCH-1031:


There are differences between trunk and 2.x:
* in org.apache.nutch.protocol.http.api.RobotRulesParser (lib-http) 2.x does 
additional plausibility checks for properties {{http.agent.name}} and 
{{http.robots.agents}}

Maybe that's worth to take also into trunk, also with respect to porting this 
issue to 2.x :)


 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, 
 NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597515#comment-13597515
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

Hi Tejas. Sorry for taking forever to get around to this. 
* I really like to documentation within the patch. Big +1 for this
* Test all pass flawlessly.
* I like the retention of the main() method in o.a.n.p.RobotRulesParser
I've tested this on several websites, including many directories within sites 
like bbc.co.uk (check out the robots.txt)
I am +1 for this Tejas. Good work on this one, its been a long time in coming 
to Nutch.
I am keen to hear from others.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594391#comment-13594391
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

MHi Tejas. If you go to search maven you will see the 0.2 release of crawler 
commons. You will be able to pull this with ivy no bother. @Tejas, I agree with 
your views on keeping CC in core ivy.xml as it is likely that we will use it 
for the sitemaps at some stage as well. Great work Tejas.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-25 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585730#comment-13585730
 ] 

lufeng commented on NUTCH-1031:
---

Hi Tejas

1. The EmptyRobotRules class is not delete in patch NUTCH-1031-trunk.v2.patch 
file.
2. Shoud we add CC dependency in ivy.xml configuration.
3. Can we create a RobotRulesParser as a nutch plugin and extract the 
Protocol#getRobotRules method. So we can move the CC dependency from nutc-core 
to nutch-plugin.

Thanks




 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-24 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585467#comment-13585467
 ] 

Tejas Patil commented on NUTCH-1031:


Hi Sebastian,
Thanks for your time and suggesting the changes. 
regarding the junits: I would remove those from nutch as CC already has a their 
own tests and no point in testing it again in nutch. 

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584824#comment-13584824
 ] 

Sebastian Nagel commented on NUTCH-1031:


Hi Tejas, a test of NUTCH-1031-trunk.v2.patch in combination with 
crawler-commons-2.0 shows:
- protocol.RobotRulesParser.main does not work properly:
  ** robotName is not filled properly by the agent-name+ arguments
  ** parsed rules are printed in the Object string representation (e.g., 
SimpleRobotRules@2c2f1921)
- testRobotsTwoAgents failed. However, the tests are quite complex: Shouldn't 
we trust on the exhaustive tests by crawler-commons? A simple test may be 
sufficient to test the basic functionality and, eg. agent names separated by 
comma.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-21 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583013#comment-13583013
 ] 

Tejas Patil commented on NUTCH-1031:


Hey Ken, A gentle reminder for releasing CC.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-21 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583340#comment-13583340
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

Hi Tejas. We released it ;)

Really sorry for not updating

https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583013#comment-13583013]
CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch,
NUTCH-1031.v1.patch
http://code.google.com/p/crawler-commons/] which contains a parser for
robots.txt files. This parser should also be better than the one we
currently have in Nutch. I will delegate this functionality to CC as soon
as it is available publicly
administrators

-- 
*Lewis*


 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-21 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583662#comment-13583662
 ] 

Tejas Patil commented on NUTCH-1031:


Hi Lewis,

I should have checked on the main page of CC before asking over jira. Anyways, 
thanks for news :) 

Regarding - delegating the functionality: I had already done that change for 
both 1.x and 2.x last month. Was waiting for the release of CC. If possible, 
can you review the patches ?

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-21 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583664#comment-13583664
 ] 

Tejas Patil commented on NUTCH-1031:


@Dev: I am planning to commit this change in coming days. If anyone has 
suggestions please feel free to share your thoughts.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-23 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560877#comment-13560877
 ] 

Ken Krugler commented on NUTCH-1031:


I've rolled this into trunk at crawler-commons. Next step is to roll a release. 
Not sure when I'll get to that, but on my list for this week.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-22 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560420#comment-13560420
 ] 

Ken Krugler commented on NUTCH-1031:


Hi Tejas,

I've been on the road, but I'll check out your patch when I return back to my 
office tomorrow. Thanks for updating it with a test case!

-- Ken

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558340#comment-13558340
 ] 

Ken Krugler commented on NUTCH-1031:


Hi Tejas - I've looked at your patch, and (assuming there's not a requirement 
to support precedence in the user agent name list) it seems like a valid 
change. Based on the RFC (http://www.robotstxt.org/norobots-rfc.txt) robot 
names shouldn't have commas, so splitting on that seems safe. Do you have a 
unit test to verify proper behavior? If so, I'd be happy to roll that into CC.

-- Ken

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558349#comment-13558349
 ] 

Tejas Patil commented on NUTCH-1031:


Hi Ken, 
Thanks for reviewing the patch. I will include a test case in patch. Before 
that, a bigger question is whether Nutch should adopt the parsing model in CC 
and forget about the precedence.
BTW: Did you find any error in my understanding about how CC parses robots ?

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-19 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558050#comment-13558050
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

Is the issue with multiple agents the only downside to using CC just now?
I think your proposal is great Tejas however if we are looking into supporting 
CC for more than just robots.txt parsing then maybe we ought to look into 
donating this aspect of the Nutch code?
Wdyt?

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558195#comment-13558195
 ] 

Julien Nioche commented on NUTCH-1031:
--

bq. 1. Continue to have the legacy code for parsing robots file. 
bq. 2. As an add-in, crawler-commons can be employed for the parsing. User can 
pick based on a config parameter with a note indicating that #2 wont work with 
multiple HTTP agents.

2 is an overkill IMHO. the existing code works fine and the point in moving to 
CC was to get rid of some of our code, not make it bigger with yet another 
configuration. 

Lewis : donating out code is a good idea but in the case of the robots parsing 
it's more about modifying the existing one in CC. I haven't had time to look at 
robot parsing in CC and am not familiar with it but it would be a good thing to 
improve it. In the meantime let's go for option 1. Thanks!


 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-18 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557930#comment-13557930
 ] 

Tejas Patil commented on NUTCH-1031:


After waiting for more than a week, I think that there is low chance of getting 
a fix / change from crawler-commons. 
I propose following:
1. Continue to have the legacy code for parsing robots file.
2. As an add-in, crawler-commons can be employed for the parsing. 

User can pick based on a config parameter with a note indicating that #2 wont 
work with multiple HTTP agents.
Should this be fine ?

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545958#comment-13545958
 ] 

Julien Nioche commented on NUTCH-1031:
--

well we have 2 separate params : http.agent.name which is a single value sent 
to the servers when fetching and http.robots.agents which can have multiple 
values and is used for parsing robots. The value of this parameter SHOULD be 
split based on commas.

I don't think CC supports multiple values for http.robots.agents, but I'll ask 
Ken to be sure.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545989#comment-13545989
 ] 

Markus Jelsma commented on NUTCH-1031:
--

I think it would be a _very_ good thing to maintain support for multiple users 
agents as it provides flexibility to crawler operators to be lenient on how 
webmasters spell the crawler name in their robots.txt.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546639#comment-13546639
 ] 

Tejas Patil commented on NUTCH-1031:


The current nutch robots parsing logic is uses the later approach for parsing. 
Having a new API for passing a list of robots names would be a clean solution.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-06-21 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398338#comment-13398338
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

crawler-commons is available within maven central. Are we still interested in 
delegating our parsing code to crawler commons? What is the community like over 
at crawler-commons e.g. if we find bugs in the code how when will/could they 
get fixed?

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.6


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-06-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398340#comment-13398340
 ] 

Julien Nioche commented on NUTCH-1031:
--

crawler-commons is not super active and I have been pretty much the only person 
actively involved. There have been bugfixes since the release but not 
necessarily committed IIRC
The robots parsing is working OK in Nutch and we have loads of other things to 
work on which are probably more important :-)


 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.6


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-01-12 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185285#comment-13185285
 ] 

Lewis John McGibbney commented on NUTCH-1031:
-

Hi Julien, out of shear curiosity, how do we currently parse robots.txt? I 
found some files (which don't do parsing) in o.a.n.protocol but I've never 
known what we use for robots.txt

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.5


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira