I'm using Nutch 2 with Cassandra and using dmoz urls for my seed. (using
subset of 5000)
After every successful parsing, I get this exception.
2013-10-10 14:52:08,517 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2013-10-10 14:52:08,949 INFO parse.ParserJob - ParserJob: success
2013-10-10 14:52:09,988 INFO crawl.DbUpdaterJob - DbUpdaterJob: starting
2013-10-10 14:52:10,905 INFO connection.CassandraHostRetryService - Downed
Host Retry service started with queue size -1 and retry delay 10s
2013-10-10 14:52:10,997 INFO service.JmxMonitor - Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
2013-10-10 14:52:11,189 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2013-10-10 14:52:11,523 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-10-10 14:59:00,217 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-10-10 14:59:00,220 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-10-10 14:59:00,220 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-10-10 14:59:00,220 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2013-10-10 14:59:01,624 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 10000 records
2013-10-10 14:59:10,058 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 20000 records
2013-10-10 14:59:20,642 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 30000 records
2013-10-10 14:59:29,553 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 40000 records
2013-10-10 14:59:35,664 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 50000 records
2013-10-10 14:59:45,496 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 60000 records
2013-10-10 14:59:56,833 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 70000 records
2013-10-10 15:00:03,195 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 80000 records
2013-10-10 15:00:07,586 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 90000 records
2013-10-10 15:00:15,199 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 100000 records
2013-10-10 15:00:19,123 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 110000 records
2013-10-10 15:00:23,250 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 120000 records
2013-10-10 15:00:37,575 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 130000 records
2013-10-10 15:00:59,263 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 140000 records
2013-10-10 15:01:17,698 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 150000 records
2013-10-10 15:01:28,902 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 160000 records
2013-10-10 15:01:40,918 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 170000 records
2013-10-10 15:01:49,767 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 180000 records
2013-10-10 15:02:01,830 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 190000 records
2013-10-10 15:02:12,153 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 200000 records
2013-10-10 15:02:21,576 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 210000 records
2013-10-10 15:02:31,489 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 220000 records
2013-10-10 15:02:40,369 INFO mapreduce.GoraRecordWriter - Flushing the
datastore after 230000 records
2013-10-10 15:02:51,893 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2013-10-10 15:02:51,894 WARN mapred.LocalJobRunner -
job_local599368945_0001
java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkBounds(Buffer.java:559)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:143)
at
org.apache.avro.ipc.ByteBufferInputStream.read(ByteBufferInputStream.java:52)
at
org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:183)
at
org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265)
at
org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131)
at
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:280)
at
org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:191)
at
org.apache.gora.avro.PersistentDatumReader.readMap(PersistentDatumReader.java:183)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
at
org.apache.gora.avro.PersistentDatumReader.readRecord(PersistentDatumReader.java:139)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:80)
at
org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:103)
at
org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:98)
at
org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:73)
at
org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:36)
at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:217)
at
org.apache.nutch.util.WebPageWritable.readFields(WebPageWritable.java:45)
at
org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
at
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
My hunch is that it could be on gora's configuration settings. Anybody
encountered this issue.
Thanks in advance,
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-2-throws-an-IndexOutOfBoundsException-tp4094589.html
Sent from the Nutch - User mailing list archive at Nabble.com.