[ https://issues.apache.org/jira/browse/NUTCH-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893718#comment-13893718 ]
Lewis John McGibbney edited comment on NUTCH-1723 at 2/6/14 7:52 PM: --------------------------------------------------------------------- hi [~ksmets], this is a good catch but a pretty nasty one to deal with. We are currently working on a GORA_94 branch which is an Avro upgrade 1.3.3 --> 1.7.X and new persistency API... I am trying to focus my time on more pressing issues such as this one so I'm personally not going to try and get a fix right now. If this issue is still present when we release Gora 0.3 then I'll look in to this in detail. Thanks for logging this bug... it is a PITA indeed! Are you able to resume crawls or does the job task fail entirely? was (Author: lewismc): hi [~ksmets], this is a good catch but a pretty nasty one to deal with. We are currently working on a GORA_94 branch which is an Avro upgrade 1.3.3 --> 1.7.X and new persistency API... I am trying to focus my time on more pressing issues such as this one so I'm personally not going to try and get a fix right now. If this issue is still present when we release Gora 0.3 then I'll look in to this in detail. Thanks for logging this bug... it is a PITA indeed! > nutch updatedb fails due to avro (de)serialization issues on images > ------------------------------------------------------------------- > > Key: NUTCH-1723 > URL: https://issues.apache.org/jira/browse/NUTCH-1723 > Project: Nutch > Issue Type: Bug > Components: crawldb, parser > Affects Versions: 2.3, 2.2.1 > Environment: - Ubuntu 12.04.3 LTS (GNU/Linux 3.2.0-36-generic x86_64) > - DataStax Community Edition Apache Cassandra 2.0.4 > Reporter: Koen Smets > Labels: avro, cassandra, gora, gora-cassandra, nutch, tika > Fix For: 2.3 > > > Running `bin/crawl` for 2 iterations using either the nutch-2.2.1 release or > the latest 2.x checkout on a seed file containing for example > http://www.mountsinai.on.ca and http://www.dhzb.de (or any other webpage with > image files with no obvious file extensions) causes to throw either > java.lang.IllegalArgument, IOException and/or OutOfBoundsExceptions in the > the readFields function of WebPageWritable: > @Override > public void readFields(DataInput in) throws IOException { > webPage = IOUtils.deserialize(getConf(), in, webPage, WebPage.class); > } > @Override > public void write(DataOutput out) throws IOException { > IOUtils.serialize(getConf(), out, webPage, WebPage.class); > } > 2014-02-04 13:50:15,421 INFO util.WebPageWritable - Try reading fields: ... > 2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Failed to read > fields: http://www.mountsinai.on.ca/carousel/patient-care-banner/image > 2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Reading fields > of the WebPage class failed - java.lang.IllegalArgumentException > 2014-02-04 13:50:15,425 ERROR util.WebPageWritable - Error - Printing > stacktrace - java.lang.IllegalArgumentException > Or, > java.lang.IndexOutOfBoundsException > at java.nio.Buffer.checkBounds(Buffer.java:559) > at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:143) > at > org.apache.avro.ipc.ByteBufferInputStream.read(ByteBufferInputStream.java:52) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:183) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265) > at > org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:280) > at > org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:191) > at > org.apache.gora.avro.PersistentDatumReader.readMap(PersistentDatumReader.java:183) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83) > at > org.apache.gora.avro.PersistentDatumReader.readRecord(PersistentDatumReader.java:139) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:80) > at > org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:103) > at > org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:98) > at > org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:73) > at > org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:36) > at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:205) > at > org.apache.nutch.util.WebPageWritable.readFields(WebPageWritable.java:45) > at > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117) > at > org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) > The exceptions are caused by image files that sneak through the urlfilter (no > extension indicating an image file) and that get (properly?) parsed by tika > library. > Note that silently catching the thrown exceptions causes corruption of the > Cassandra database, as the deserializer reads over multiple webpage entries > in the DataInput. Resulting in a loss of several pages of other host present > in the seed file. > Moreover, if one makes sure that the image pages don't end up in the > DataInput written by DBUpdateMapper, e.g. by configuring nutch-site.xml to > disable the tika parser, the nutch dbupdate finishes properly. > <property> > <name>plugin.excludes</name> > <value>parse-tika</value> > </property> > I highly suspect that the issues are due to gora's dependency on the outdated > avro-1.3.3 library. -- This message was sent by Atlassian JIRA (v6.1.5#6160)