Nutch 2 throws an IndexOutOfBoundsException

ferrlin Thu, 10 Oct 2013 09:53:13 -0700

I'm using Nutch 2 with Cassandra and using dmoz urls for my seed. (using
subset of 5000)


After every successful parsing, I get this exception. 

2013-10-10 14:52:08,517 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2013-10-10 14:52:08,949 INFO  parse.ParserJob - ParserJob: success
2013-10-10 14:52:09,988 INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting
2013-10-10 14:52:10,905 INFO  connection.CassandraHostRetryService - Downed
Host Retry service started with queue size -1 and retry delay 10s
2013-10-10 14:52:10,997 INFO  service.JmxMonitor - Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
2013-10-10 14:52:11,189 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-10-10 14:52:11,523 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-10-10 14:59:00,217 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-10-10 14:59:00,220 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-10-10 14:59:00,220 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-10-10 14:59:00,220 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-10-10 14:59:01,624 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 10000 records
2013-10-10 14:59:10,058 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 20000 records
2013-10-10 14:59:20,642 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 30000 records
2013-10-10 14:59:29,553 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 40000 records
2013-10-10 14:59:35,664 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 50000 records
2013-10-10 14:59:45,496 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 60000 records
2013-10-10 14:59:56,833 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 70000 records
2013-10-10 15:00:03,195 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 80000 records
2013-10-10 15:00:07,586 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 90000 records
2013-10-10 15:00:15,199 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 100000 records
2013-10-10 15:00:19,123 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 110000 records
2013-10-10 15:00:23,250 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 120000 records
2013-10-10 15:00:37,575 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 130000 records
2013-10-10 15:00:59,263 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 140000 records
2013-10-10 15:01:17,698 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 150000 records
2013-10-10 15:01:28,902 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 160000 records
2013-10-10 15:01:40,918 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 170000 records
2013-10-10 15:01:49,767 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 180000 records
2013-10-10 15:02:01,830 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 190000 records
2013-10-10 15:02:12,153 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 200000 records
2013-10-10 15:02:21,576 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 210000 records
2013-10-10 15:02:31,489 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 220000 records
2013-10-10 15:02:40,369 INFO  mapreduce.GoraRecordWriter - Flushing the
datastore after 230000 records
2013-10-10 15:02:51,893 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2013-10-10 15:02:51,894 WARN  mapred.LocalJobRunner -
job_local599368945_0001
java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:559)
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:143)
        at
org.apache.avro.ipc.ByteBufferInputStream.read(ByteBufferInputStream.java:52)
        at
org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:183)
        at
org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265)
        at
org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131)
        at
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:280)
        at
org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:191)
        at
org.apache.gora.avro.PersistentDatumReader.readMap(PersistentDatumReader.java:183)
        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
        at
org.apache.gora.avro.PersistentDatumReader.readRecord(PersistentDatumReader.java:139)
        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:80)
        at
org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:103)
        at
org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:98)
        at
org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:73)
        at
org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:36)
        at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:217)
        at
org.apache.nutch.util.WebPageWritable.readFields(WebPageWritable.java:45)
        at
org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
        at
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)





My hunch is that it could be on gora's configuration settings. Anybody
encountered this issue.

Thanks in advance,





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-2-throws-an-IndexOutOfBoundsException-tp4094589.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch 2 throws an IndexOutOfBoundsException

Reply via email to