> I've spent some time working on this as well. I've just put together a
>> blog entry addressing the issues I ran into. See
>> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
>>
>
> This is a great howto for Nutch 2.0. Feel free to link to it from the Wiki,
> this could be useful to others.

A link has been added on the Nutch wiki frontpage in Nutch 2.0 section. Thanks!
I added in the blog a small paragraph that shows how to run a Nutch
unit test from Eclipse.

> I don't remember seeing any of the issues you mentioned in the Nutch JIRA.
> If you think something is a bug, why not reporting it? The same applies to
> the fixes you suggested for GORA.

I've created a new issue in the Jira Gora section:
https://issues.apache.org/jira/browse/GORA-20


>
>>
>> In a nutchsell, I changed three pieces in Gora and Nutch code:
>> - flush the datastore regularly in the Hadoop RecordWriter (in
>> GoraOutputFormat)
>> - wait for Hadoop job completion in the Fetcher job
>> - ensure that the content length limit is not being exceeded in
>> protocol-http plugin (only for MySQL datastore)
>>
>
> the content length limit issue can also be fixed by modifying the gora
> schema for the MySQL backend. It would make sense to allow larger values by
> default. Could you please open a JIRA for this?

I commented on https://issues.apache.org/jira/browse/NUTCH-899 which
is the same problem. I tried to come up with a JUnit test but it is
still rather imperfect (I want to use
org.apache.nutch.util.CrawTestUtil.getServer for it). The whole patch
is here:
https://issues.apache.org/jira/secure/attachment/12466548/httpContentLimit.patch

Alexis

Reply via email to