Brian Whitman wrote:
> I wanted to try last night's nightly for the new freegen command.
> On my test case, which is:
>
> rm -rf crawl
> bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls
> bin/nutch generate crawl/crawldb crawl/segments
> bin/nutch fetch crawl/segments/2007...
> bin/nutch updatedb crawl/crawldb crawl/segments/2007...
>
> # generate a new segment with 5 URIs
> bin/nutch generate crawl/crawldb crawl/segments -topN 10
> bin/nutch fetch crawl/segments/2007... # new segment
> bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
>
> # merge the segments and index
> bin/nutch mergesegs crawl/merged -dir crawl/segments
> ..
>
> We get a crash in the mergesegs. This crash, with the exact same 
> script and start URI, configuration and plugins, does not happen on a 
> nightly from a week ago.
>
>
> 2007-01-18 14:57:11,411 INFO  segment.SegmentMerger - Merging 2 
> segments to crawl/merged_07_01_18_14_56_22/20070118145711
> 2007-01-18 14:57:11,482 INFO  segment.SegmentMerger - SegmentMerger:   
> adding crawl/segments/20070118145628
> 2007-01-18 14:57:11,489 INFO  segment.SegmentMerger - SegmentMerger:   
> adding crawl/segments/20070118145641
> 2007-01-18 14:57:11,495 INFO  segment.SegmentMerger - SegmentMerger: 
> using segment data from: content crawl_generate crawl_fetch 
> crawl_parse parse_data parse_text
> 2007-01-18 14:57:11,594 INFO  mapred.InputFormatBase - Total input 
> paths to process : 12
> 2007-01-18 14:57:11,819 INFO  mapred.JobClient - Running job: job_5ug2ip
> 2007-01-18 14:57:12,073 WARN  mapred.LocalJobRunner - job_5ug2ip
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at 
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) 
>
>         at 
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
>         at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
>         at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
>         at 
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)

UTF8? How weird - recent versions of Nutch tools, such as Crawl, 
Generate et al (and SegmentMerger) do NOT use UTF8, they use Text. It 
seems this data was created with older versions. Please check that you 
don't have older versions of Hadoop or nutch classes on you classpath.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to