Re: Nutch ignores robots.txt

2011-11-16 Thread Mathijs Homminga
Hi Lewis, I believe that you can find the robots.txt of the site here: http://www.kinoundco.de/robots.txt I think he followed the instructions at http://lucene.apache.org/nutch/bot.html (this outdated URL is still in the HttpBase.java btw) correctly. My guess is that the guys at pixray.com have

[jira] [Commented] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)

2011-11-16 Thread Hudson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151828#comment-13151828 ] Hudson commented on NUTCH-1196: --- Integrated in Nutch-nutchgora #70 (See [https://builds.apa

Re: Nutch ignores robots.txt

2011-11-16 Thread Lewis John Mcgibbney
Hi Maximilian, What Iwere missing is the robots.txt itself. I.e how are you trying to ban Nutch. I've been in touch with the guys at traffic server with your issue to to see if they have suggestions without totally banning all Nutch instances from contacting your webserver. To all dev's, the othe

Re: [VOTE] Apache Nutch 1.4 release rc #1

2011-11-16 Thread Markus Jelsma
> Thanks for the FYI guys. > > I've got this on my open source radar, along with > reviewing the Airavata release (incubating), and > the MRUnit release (incubating) for this week. > > I'll git er' done. Also, since the release updates for rc #2 > were largely aesthetic (aka packaging and naming

Re: [VOTE] Apache Nutch 1.4 release rc #1

2011-11-16 Thread Mattmann, Chris A (388J)
Thanks for the FYI guys. I've got this on my open source radar, along with reviewing the Airavata release (incubating), and the MRUnit release (incubating) for this week. I'll git er' done. Also, since the release updates for rc #2 were largely aesthetic (aka packaging and naming of the outp

Re: [VOTE] Apache Nutch 1.4 release rc #1

2011-11-16 Thread Markus Jelsma
> Chris, > > Any idea of when you'll be able to push a new RC for 1.4? > Note : I think some stuff marked as 1.5 has been committed - we might need > to check the CHANGES Definately, i've committed several items. When i did my first trunk was already prepared for 1.5. Here's the list of change

Re: [VOTE] Apache Nutch 1.4 release rc #1

2011-11-16 Thread Julien Nioche
Chris, Any idea of when you'll be able to push a new RC for 1.4? Note : I think some stuff marked as 1.5 has been committed - we might need to check the CHANGES Thanks Julien On 9 November 2011 10:21, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi Julien, > > Thanks. OK,

[jira] [Reopened] (NUTCH-1081) ant tests fail

2011-11-16 Thread Lewis John McGibbney (Reopened) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reopened NUTCH-1081: - Reopening this issue as per our concerns. For the record, the Jenkins build area ha

[jira] [Closed] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)

2011-11-16 Thread Ferdy Galema (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1196. --- Resolution: Fixed Committed. > Update job should impose an upper limit on the number

WG: Nutch ignores robots.txt

2011-11-16 Thread Maximilian Laurenz
All requests seem to come from a German company called http://www.pixray.com, which obviously ignores the robots.txt with their version of the Nutch crawler. We informed them and will ban their IP-range, if they don't stop to scan us with invalid requests. Sincerely, Maximilian Laurenz S&L Medi

[jira] [Closed] (NUTCH-1148) Nutchgora job jar functionalilty is broken: PluginManifestParser cannot load plugins from system classloader.

2011-11-16 Thread Ferdy Galema (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1148. --- Resolution: Cannot Reproduce Wow this really boggles my mind: I tried to do a final check with and wi

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-16 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1185-1.5-9.patch > Fetcher to parse and follow Nth degree outlinks > --

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-16 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: (was: NUTCH-1185-1.5-9.patch) > Fetcher to parse and follow Nth degree outlin

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-16 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1185-1.5-9.patch New patch [9] solves an issue of NPE in filtering. It's now re

Re: svn commit: r1202387 - /nutch/branches/nutchgora/build.xml

2011-11-16 Thread Ferdy Galema
Hi Lewis, Please note that although most the formatting has been reverted, the indent style is still not as usual. (You converted spaces to tabs.) When using the default Eclipse xml editor, you can easily overcome this by setting the preference "Indent using spaces" in XML --> XML Files --> E