[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-28 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1031:
---

Attachment: NUTCH-1031-2.x.v1.patch

Patch for 2.x. If there are no objections, would commit in coming days.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-2.x.v1.patch, 
 NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, 
 NUTCH-1031-trunk.v4.patch, NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-08 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1031:
---

Attachment: NUTCH-1031-trunk.v5.patch

Thanks Lewis :) I have corrected the usage message.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, 
 NUTCH-1031-trunk.v5.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-05 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1031:
---

Attachment: NUTCH-1031-trunk.v4.patch

Hey Lewis, Thanks for pointing that out :) I have updated the patch.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1031:
---

Attachment: CC.robots.multiple.agents.v2.patch

Hi Ken, I have added a test case to CC for the change. 
(CC.robots.multiple.agents.v2.patch)

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1031:
---

Attachment: NUTCH-1031-trunk.v2.patch

Added a patch for nutch trunk (NUTCH-1031-trunk.v2.patch). If nobody has 
objection, i will work on corresponding patch for 2.x.
Summary of the changes done:
- Removed RobotRules class as CC provides a replacement: BaseRobotRules
- Moved RobotRulesParser from http plugin in account to NUTCH-1513, other 
protocols might share the it.
- Added HttpRobotRulesParser which will be responsible for getting the robots 
file using http protocol.
- Changed references from old nutch classes to classes from CC.

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, 
 CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch, 
 NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1031:
---

Attachment: CC.robots.multiple.agents.patch

I looked at the source code of CC to understand how it works. I have identified 
the change to be done to CC so that it supports multiple user agents. While 
testing the same, I have found that there a semantic difference in the way CC 
works as compared to legacy nutch parser.

*What CC does:*
It will split the _http.robots.agents_ over comma (the change that i did 
locally)
It scans the robots file line by line, each time finding if there is a match of 
the current User-Agent from file with any one of from  _http.robots.agents_. 
If match is found it will take all the corresponding rules for that agent and 
stop further parsing. 

{noformat}robots file
User-Agent: Agent1 #foo
Disallow: /a

User-Agent: Agent2 Agent3
Disallow: /d

http.robots.agents: Agent2,Agent1

Path: /a{noformat}

For the example above, as soon as first line of robots file is scanned, a match 
for Agent1 is found. It will scan all the corresponding rules for that agent 
and will store only this information:
{noformat}User-Agent: Agent1
Disallow: /a{noformat}

Rest all stuff is neglected.

*What nutch robots parser does:*
It will split the _http.robots.agents_ over comma. It scans ALL the lines of 
the robots file and evaluates the matches in terms of the precedence of the 
user agents.
For above example, the rules corresponding to both Agent2 and Agent1 have a 
match in robots file, but as Agent2 comes first in _http.robots.agents_, it is 
given priority and the rules stored will be:
{noformat}User-Agent: Agent2
Disallow: /d{noformat}

If we want to leave behind the precendence based thing and adopt the model in 
CC, then I have a small patch for crawler-commons 
(CC.robots.multiple.agents.patch).

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1031:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.6


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira