OK , I have tried some other urls and apparently I get DBUpdate job
exception on this url only
http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA.
So there is some data on this url which is causing problem to my cassandra
update db job.

Any ideas where should I look further to resolve this issue ?

Thanks
Tony.


On Mon, Jun 17, 2013 at 11:35 PM, Tony Mullins <[email protected]>wrote:

> I am using gora comes with Nutch2.x ( i think its 0.3 ) with cassandra
> 1.2.5. And getting the above mentioned error.
> Any hints how should I tackle this problem , any suggestions plz ?
>
> If I do simple crawl like www.google.com , all works fine !!!
>
> Thanks,
> Tony.
>
>
> On Mon, Jun 17, 2013 at 11:21 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi Tony,
>> Which gora backend are you on, including the version of the backend itself
>> please?
>> I use Gora 0.3 with gora-cassandra on some cron jobs and injected your
>> URLs
>> into my db. All works fine.
>> I did notice that these pages have a hellish lots of content which is not
>> displayed on the page. Loads of CSS and garbage.
>> Something else I did notice is that if you enable microformats-reltag and
>> parse reltag's out, you get loads of bad characters.,.. which is not nice.
>> I will log a Jira as this should be fixed.
>>
>>
>>
>>
>>
>> On Mon, Jun 17, 2013 at 10:08 AM, Tony Mullins <[email protected]
>> >wrote:
>>
>> > Hi ,
>> >
>> > I am getting weird error on DBUpdater Job in Nutch2.x.
>> > I am crawling these two links
>> >
>> >
>> >
>> http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
>> >
>> http://www.amazon.com/Cisco-WAP4410N-Wireless-N-Access-Point/dp/B001IYCMNA
>> >
>> > And my all jobs are running fine , when I run my dpupdate job I get this
>> > error
>> >
>> > Exception in thread "main" java.lang.RuntimeException: job failed:
>> > name=update-table, jobid=job_local482736560_0001
>> >     at
>> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>> >     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:98)
>> >     at
>> > org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:105)
>> >     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:119)
>> >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >     at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:123)
>> >
>> > And hadoop log file says
>> >
>> > 2013-06-17 21:51:41,478 WARN  mapred.FileOutputCommitter - Output path
>> is
>> > null in cleanup
>> > 2013-06-17 21:51:41,479 WARN  mapred.LocalJobRunner -
>> > job_local384125843_0001
>> > java.lang.IndexOutOfBoundsException
>> >     at java.nio.Buffer.checkBounds(Buffer.java:559)
>> >     at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:143)
>> >     at
>> >
>> >
>> org.apache.avro.ipc.ByteBufferInputStream.read(ByteBufferInputStream.java:52)
>> >     at
>> >
>> >
>> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:183)
>> >     at
>> org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265)
>> >     at
>> >
>> >
>> org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131)
>> >
>> > And if I crawl simple page like www.google.nl .. every thing works
>> fine ,
>> > including dbupdate job !!!
>> >
>> > Any clues how to debug this issues ? what could be the reason for this ?
>> >
>> > Thanks.
>> > Tony.
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Reply via email to