[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031.v1.patch The changes are done. Please let me know your comments. One issue: I

[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545691#comment-13545691 ] Tejas Patil commented on NUTCH-1513: Hi Lewis, Thanks for your suggestion. I think tha

[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545748#comment-13545748 ] Ferdy Galema commented on NUTCH-1508: - Hi, Is this related to? https://issues.apache.

[jira] [Comment Edited] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545748#comment-13545748 ] Ferdy Galema edited comment on NUTCH-1508 at 1/7/13 10:15 AM: --

[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545757#comment-13545757 ] Julien Nioche commented on NUTCH-1508: -- Hi Ferdy I did not see NUTCH-1431 at all :-(

[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545766#comment-13545766 ] Ferdy Galema commented on NUTCH-1508: - NUTCH-1431 (aka 'distance' concept) only define

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545958#comment-13545958 ] Julien Nioche commented on NUTCH-1031: -- well we have 2 separate params : http.agent.n

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545989#comment-13545989 ] Markus Jelsma commented on NUTCH-1031: -- I think it would be a _very_ good thing to ma

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546089#comment-13546089 ] Ken Krugler commented on NUTCH-1031: Based on my reading of the robots.txt RFC ("The r

Build failed in Jenkins: Nutch-trunk #2071

2013-01-07 Thread Apache Jenkins Server
See -- [...truncated 3442 lines...] [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-host copy-generated-lib: [

[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546409#comment-13546409 ] Lewis John McGibbney commented on NUTCH-1494: - The patch in this issue is not

[jira] [Commented] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546414#comment-13546414 ] Lewis John McGibbney commented on NUTCH-1511: - Is this a problem somewhere wit

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546424#comment-13546424 ] Lewis John McGibbney commented on NUTCH-1245: - So this patch is good for testi

[jira] [Commented] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546453#comment-13546453 ] Lewis John McGibbney commented on NUTCH-978: Hi Emmanuel, do you wish to addres

[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546454#comment-13546454 ] Tejas Patil commented on NUTCH-1494: Hi Lewis, I have could not run nutch with rome 0

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1284: Patch Info: Patch Available > Add site fetcher.max.crawl.delay as log output by

[jira] [Updated] (NUTCH-1509) Implement read/write in NutchField

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1509: Patch Info: Patch Available > Implement read/write in NutchField >

[jira] [Updated] (NUTCH-1507) Remove FetcherOutput

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1507: Patch Info: Patch Available > Remove FetcherOutput > > >

[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546459#comment-13546459 ] Lewis John McGibbney commented on NUTCH-1494: - Hi Tejas, try here http://searc

[jira] [Updated] (NUTCH-1506) Add UPDATE action to NutchIndexAction

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1506: Patch Info: Patch Available > Add UPDATE action to NutchIndexAction >

[jira] [Commented] (NUTCH-1505) java.lang.IllegalArgumentException during updatedb

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546462#comment-13546462 ] Lewis John McGibbney commented on NUTCH-1505: - Hi Stanley, can you please tell

[jira] [Comment Edited] (NUTCH-1505) java.lang.IllegalArgumentException during updatedb

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546462#comment-13546462 ] Lewis John McGibbney edited comment on NUTCH-1505 at 1/8/13 12:48 AM: --

[jira] [Updated] (NUTCH-710) Support for rel="canonical" attribute

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-710: --- Patch Info: Patch Available Fix Version/s: 2.2 > Support for rel="canonical

[jira] [Updated] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1494: --- Attachment: NUTCH-1494.3.patch @Lewis: it worked :) I have attached the patch. Please let me know you

Re: Failing Nightly Builds

2013-01-07 Thread Tejas Patil
Hi Lewis, These test cases pass on my machine (i guess on yours' too). Had it been related to Hadoop API then tests must fail everywhere. What is different about the setup where the nightly builds are executed ? Thanks, Tejas Patil On Mon, Jan 7, 2013 at 3:24 PM, Lewis John Mcgibbney < lewis.mc

[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546564#comment-13546564 ] Lewis John McGibbney commented on NUTCH-840: Hi Julien. I cleaned my local ivy

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Patch Info: Patch Available > Port tests from parse-html to parse-tika > -

[jira] [Commented] (NUTCH-1119) JUnit test for index-static

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546569#comment-13546569 ] Lewis John McGibbney commented on NUTCH-1119: - Hi Tejas, I've tested reviewed

[jira] [Resolved] (NUTCH-1119) JUnit test for index-static

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1119. - Resolution: Fixed Committed @revision 1430129 in trunk Thank you Tejas for your c

[jira] [Updated] (NUTCH-1224) Migrate FreeGenerator to MapReduce API

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1224: Patch Info: Patch Available > Migrate FreeGenerator to MapReduce API >

Re: Failing Nightly Builds

2013-01-07 Thread Lewis John Mcgibbney
Hi Tejas, The Jenkins seems to have had a reboot (or something of this nature) around Christmas. I need to be honest and say that I don't know the source of the problem. Saying that, Hadoop (and other technologies) can also be a funny bugger sometimes when it comes to security, proxy, inet address

[jira] [Commented] (NUTCH-1127) JUnit test for urlfilter-validator

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546580#comment-13546580 ] Lewis John McGibbney commented on NUTCH-1127: - Hi Tejas I've looked at the tes

[jira] [Resolved] (NUTCH-1127) JUnit test for urlfilter-validator

2013-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1127. - Resolution: Fixed Committed @revision 1430135 in trunk Thank you Tejas for the pa

Build failed in Jenkins: Nutch-nutchgora #457

2013-01-07 Thread Apache Jenkins Server
See -- [...truncated 2807 lines...] copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-suffix ini

[jira] [Commented] (NUTCH-1119) JUnit test for index-static

2013-01-07 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546595#comment-13546595 ] Hudson commented on NUTCH-1119: --- Integrated in Nutch-trunk #2072 (See [https://builds.apach

[jira] [Commented] (NUTCH-1127) JUnit test for urlfilter-validator

2013-01-07 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546596#comment-13546596 ] Hudson commented on NUTCH-1127: --- Integrated in Nutch-trunk #2072 (See [https://builds.apach

Build failed in Jenkins: Nutch-trunk #2072

2013-01-07 Thread Apache Jenkins Server
See Changes: [lewismc] NUTCH-1127 JUnit test for urlfilter-validator [lewismc] NUTCH-1119 JUnit test for index-static -- [...truncated 3464 lines...] copy-generated-lib: [copy] Copying 1 file

[jira] [Commented] (NUTCH-1119) JUnit test for index-static

2013-01-07 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546606#comment-13546606 ] Hudson commented on NUTCH-1119: --- Integrated in nutch-trunk-maven #553 (See [https://builds.

[jira] [Commented] (NUTCH-1127) JUnit test for urlfilter-validator

2013-01-07 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546607#comment-13546607 ] Hudson commented on NUTCH-1127: --- Integrated in nutch-trunk-maven #553 (See [https://builds.

Re: Failing Nightly Builds

2013-01-07 Thread Tejas Patil
Hi Lewis, I feel that this issue might be surrounding "/etc/hosts" file. In [0], Dennis Kubes suggested some change to the hosts file. (same thing was mentioned in article [1]). In [2], the suggested to check if ssh works using hostname and ip. [0] : http://lucene.472066.n3.nabble.com/Nutch-Crawl

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546639#comment-13546639 ] Tejas Patil commented on NUTCH-1031: The current nutch robots parsing logic is uses th