Re: Does Nutch 2.0 in good enough shape to test?

2010-12-18 Thread Alexis
> I've spent some time working on this as well. I've just put together a
>> blog entry addressing the issues I ran into. See
>> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
>>
>
> This is a great howto for Nutch 2.0. Feel free to link to it from the Wiki,
> this could be useful to others.

A link has been added on the Nutch wiki frontpage in Nutch 2.0 section. Thanks!
I added in the blog a small paragraph that shows how to run a Nutch
unit test from Eclipse.

> I don't remember seeing any of the issues you mentioned in the Nutch JIRA.
> If you think something is a bug, why not reporting it? The same applies to
> the fixes you suggested for GORA.

I've created a new issue in the Jira Gora section:
https://issues.apache.org/jira/browse/GORA-20


>
>>
>> In a nutchsell, I changed three pieces in Gora and Nutch code:
>> - flush the datastore regularly in the Hadoop RecordWriter (in
>> GoraOutputFormat)
>> - wait for Hadoop job completion in the Fetcher job
>> - ensure that the content length limit is not being exceeded in
>> protocol-http plugin (only for MySQL datastore)
>>
>
> the content length limit issue can also be fixed by modifying the gora
> schema for the MySQL backend. It would make sense to allow larger values by
> default. Could you please open a JIRA for this?

I commented on https://issues.apache.org/jira/browse/NUTCH-899 which
is the same problem. I tried to come up with a JUnit test but it is
still rather imperfect (I want to use
org.apache.nutch.util.CrawTestUtil.getServer for it). The whole patch
is here:
https://issues.apache.org/jira/secure/attachment/12466548/httpContentLimit.patch

Alexis


[jira] Updated: (NUTCH-899) java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1

2010-12-18 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-899:
-

Attachment: httpContentLimit.patch

We stick with  the default gora schema for the MySQL backend, which says 
"bytes" in the Avro definition, that is translated into "blob" in MySQL. From 
src/gora/webpage.avsc;
{"name": "WebPage",
 "type": "record",
 "namespace": "org.apache.nutch.storage",
 "fields": [
{"name": "content", "type": "bytes"},
   ]
}


There is potential bug in protocol-http. The http.content.limit value might be 
exceeded a little bit, hence the error saying that the value is too big for the 
MySQL blob column type, even tough we explicitly force http.content.limit to 
the 65535 max size.

I tried to come up with a unit test for this, which is rather imperfect. Please 
see it in the attached patch. It changes http.content.limit from 65536 to 65535 
when fetching a url which body content is big enough. The first test should see 
the error, the second should not.

Ideally we want to generate the content with a local server for the unit test 
instead of using a random internet url. That remains to be implemented in the 
test.

> java.sql.BatchUpdateException: Data truncation: Data too long for column 
> 'content' at row 1
> ---
>
> Key: NUTCH-899
> URL: https://issues.apache.org/jira/browse/NUTCH-899
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.0
> Environment: ubuntu 10.04
> JVM : 1.6.0_20
> nutch 2.0 (trunk)
> Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed 
>Reporter: Faruk Berksöz
>Priority: Minor
> Attachments: httpContentLimit.patch
>
>
> wenn i try to fetch a web page (e.g. 
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage 
> definition,
> I am seeing the following error in my hadoop logs. ,  (no error with hbase ) ;
> java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
> long for column 'content' at row 1
> at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
> at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
> at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> The type of the column 'content' is BLOB.
> It may be important for the next developments of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.