java.io.EOFException in newer nightlies in mergesegs or indexing from
hadoop.io.DataOutputBuffer
------------------------------------------------------------------------------------------------
Key: NUTCH-433
URL: https://issues.apache.org/jira/browse/NUTCH-433
Project: Nutch
Issue Type: Bug
Components: generator, indexer
Affects Versions: 0.9.0
Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform
independent
Reporter: Brian Whitman
Priority: Critical
The nightly builds have not been working at all for the past couple of weeks.
Sami Siren has narrowed it down to HADOOP-331.
To replicate: download the nightly, then:
bin/nutch inject crawl/crawldb urls/ # a single URL is in urls/urls --
http://apache.org
bin/nutch generate crawl/crawldb crawl/segments
bin/nutch fetch crawl/segments/2007...
bin/nutch updatedb crawl/crawldb crawl/segments/2007...
# generate a new segment with 5 URIs
bin/nutch generate crawl/crawldb crawl/segments -topN 5
bin/nutch fetch crawl/segments/2007... # new segment
bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
# merge the segments and index
bin/nutch mergesegs crawl/merged -dir crawl/segments
..
We get a crash in the mergesegs. This crash, with the exact same script and
start URI, configuration and plugins, does not happen on a nightly from early
January.
2007-01-18 14:57:11,411 INFO segment.SegmentMerger - Merging 2 segments to
crawl/merged_07_01_18_14_56_22/20070118145711
2007-01-18 14:57:11,482 INFO segment.SegmentMerger - SegmentMerger: adding
crawl/segments/20070118145628
2007-01-18 14:57:11,489 INFO segment.SegmentMerger - SegmentMerger: adding
crawl/segments/20070118145641
2007-01-18 14:57:11,495 INFO segment.SegmentMerger - SegmentMerger: using
segment data from: content crawl_generate crawl_fetch crawl_parse parse_data
parse_text
2007-01-18 14:57:11,594 INFO mapred.InputFormatBase - Total input paths to
process : 12
2007-01-18 14:57:11,819 INFO mapred.JobClient - Running job: job_5ug2ip
2007-01-18 14:57:12,073 WARN mapred.LocalJobRunner - job_5ug2ip
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)
at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61)
at
org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers