[ https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexis updated NUTCH-899: ------------------------- Attachment: httpContentLimit.patch We stick with the default gora schema for the MySQL backend, which says "bytes" in the Avro definition, that is translated into "blob" in MySQL. From src/gora/webpage.avsc; {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "content", "type": "bytes"}, ] } There is potential bug in protocol-http. The http.content.limit value might be exceeded a little bit, hence the error saying that the value is too big for the MySQL blob column type, even tough we explicitly force http.content.limit to the 65535 max size. I tried to come up with a unit test for this, which is rather imperfect. Please see it in the attached patch. It changes http.content.limit from 65536 to 65535 when fetching a url which body content is big enough. The first test should see the error, the second should not. Ideally we want to generate the content with a local server for the unit test instead of using a random internet url. That remains to be implemented in the test. > java.sql.BatchUpdateException: Data truncation: Data too long for column > 'content' at row 1 > ------------------------------------------------------------------------------------------- > > Key: NUTCH-899 > URL: https://issues.apache.org/jira/browse/NUTCH-899 > Project: Nutch > Issue Type: Bug > Components: storage > Affects Versions: 2.0 > Environment: ubuntu 10.04 > JVM : 1.6.0_20 > nutch 2.0 (trunk) > Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed > Reporter: Faruk Berksöz > Priority: Minor > Attachments: httpContentLimit.patch > > > wenn i try to fetch a web page (e.g. > http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage > definition, > I am seeing the following error in my hadoop logs. , (no error with hbase ) ; > java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too > long for column 'content' at row 1 > at org.gora.sql.store.SqlStore.flush(SqlStore.java:316) > at org.gora.sql.store.SqlStore.close(SqlStore.java:163) > at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72) > at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) > The type of the column 'content' is BLOB. > It may be important for the next developments of Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.