Brian Whitman wrote: > I wanted to try last night's nightly for the new freegen command. > On my test case, which is: > > rm -rf crawl > bin/nutch inject crawl/crawldb urls/ # a single URL is in urls/urls > bin/nutch generate crawl/crawldb crawl/segments > bin/nutch fetch crawl/segments/2007... > bin/nutch updatedb crawl/crawldb crawl/segments/2007... > > # generate a new segment with 5 URIs > bin/nutch generate crawl/crawldb crawl/segments -topN 10 > bin/nutch fetch crawl/segments/2007... # new segment > bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment > > # merge the segments and index > bin/nutch mergesegs crawl/merged -dir crawl/segments > .. > > We get a crash in the mergesegs. This crash, with the exact same > script and start URI, configuration and plugins, does not happen on a > nightly from a week ago. > > > 2007-01-18 14:57:11,411 INFO segment.SegmentMerger - Merging 2 > segments to crawl/merged_07_01_18_14_56_22/20070118145711 > 2007-01-18 14:57:11,482 INFO segment.SegmentMerger - SegmentMerger: > adding crawl/segments/20070118145628 > 2007-01-18 14:57:11,489 INFO segment.SegmentMerger - SegmentMerger: > adding crawl/segments/20070118145641 > 2007-01-18 14:57:11,495 INFO segment.SegmentMerger - SegmentMerger: > using segment data from: content crawl_generate crawl_fetch > crawl_parse parse_data parse_text > 2007-01-18 14:57:11,594 INFO mapred.InputFormatBase - Total input > paths to process : 12 > 2007-01-18 14:57:11,819 INFO mapred.JobClient - Running job: job_5ug2ip > 2007-01-18 14:57:12,073 WARN mapred.LocalJobRunner - job_5ug2ip > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:178) > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) > > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91) > at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) > at org.apache.hadoop.io.UTF8.readString(UTF8.java:204) > at > org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)
UTF8? How weird - recent versions of Nutch tools, such as Crawl, Generate et al (and SegmentMerger) do NOT use UTF8, they use Text. It seems this data was created with older versions. Please check that you don't have older versions of Hadoop or nutch classes on you classpath. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
