> I've spent some time working on this as well. I've just put together a >> blog entry addressing the issues I ran into. See >> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html >> > > This is a great howto for Nutch 2.0. Feel free to link to it from the Wiki, > this could be useful to others.
A link has been added on the Nutch wiki frontpage in Nutch 2.0 section. Thanks! I added in the blog a small paragraph that shows how to run a Nutch unit test from Eclipse. > I don't remember seeing any of the issues you mentioned in the Nutch JIRA. > If you think something is a bug, why not reporting it? The same applies to > the fixes you suggested for GORA. I've created a new issue in the Jira Gora section: https://issues.apache.org/jira/browse/GORA-20 > >> >> In a nutchsell, I changed three pieces in Gora and Nutch code: >> - flush the datastore regularly in the Hadoop RecordWriter (in >> GoraOutputFormat) >> - wait for Hadoop job completion in the Fetcher job >> - ensure that the content length limit is not being exceeded in >> protocol-http plugin (only for MySQL datastore) >> > > the content length limit issue can also be fixed by modifying the gora > schema for the MySQL backend. It would make sense to allow larger values by > default. Could you please open a JIRA for this? I commented on https://issues.apache.org/jira/browse/NUTCH-899 which is the same problem. I tried to come up with a JUnit test but it is still rather imperfect (I want to use org.apache.nutch.util.CrawTestUtil.getServer for it). The whole patch is here: https://issues.apache.org/jira/secure/attachment/12466548/httpContentLimit.patch Alexis