I went back to my logs and found the cause of the error:- ----- fetch of http://shoppingcenter/home.asp failed with: java.lang.OutOfMemoryError: Java heap space heap space at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45) at java.lang.StringBuilder.<init>(StringBuilder.java:68) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:557) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736) at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108) at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException fetcher caught:java.lang.NullPointerException CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20070529170525] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false - skipping invalid segment crawl/segments/20070529170525 CrawlDb update: Merging segment data into db. CrawlDb update: done -----
The NullPointerException was mentioned multiple times in the stdout log. But I have removed them above to maintain clarity. As a result this segment had only crawl_generate and nothing else. Can anyone please explain me what caused this error? How can I prevent this error from happening? On 5/29/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Manoharam Reddy wrote: > > My segment merger is not functioning properly. I am unable to figure > > out the problem. > > > > These are the commands I am using. > > > > bin/nutch inject crawl/crawldb seedurls > > > > In a loop iterating 10 times:- > > > > bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 5 > > segment=`ls -d crawl/segments/* | tail -1` > > bin/nutch fetch $segment -threads 50 > > bin/nutch updatedb crawl/crawldb $segment > > > > After loop:- > > > > bin/nutch mergesegs crawl/merged_segments crawl/segments/* > > rm -rf crawl/segments/* > > mv --verbose crawl/merged_segments/* crawl/segments > > rm -rf crawl/merged_segments > > > > Merging 10 segments to crawl/MERGEDsegments/20050529095045 > > SegmentMerger: adding crawl/segments/20050528144604 > > SegmentMerger: adding crawl/segments/20050528144619 > > SegmentMerger: adding crawl/segments/20050528145426 > > SegmentMerger: adding crawl/segments/20050528151323 > > SegmentMerger: adding crawl/segments/20050528164032 > > SegmentMerger: adding crawl/segments/20050528170544 > > SegmentMerger: adding crawl/segments/20050528192341 > > SegmentMerger: adding crawl/segments/20050528203512 > > SegmentMerger: adding crawl/segments/20050528210029 > > SegmentMerger: adding crawl/segments/20050529055733 > > SegmentMerger: using segment data from: crawl_generate > > `crawl/MERGEDsegments/20050529095045' -> `crawl/segments/20050529095045' > > > > As can be seen here, only crawl_generate was used to merge. Other > > folders like parse_data, crawl_fetch were not used. Why? > > This behavior is described in the javadoc of SegmentMerger. In general > case, users may wish to merge segments at different stages of processing > - only generated, fetched but not parsed, and parsed. It's easy to do > this if segments are homogenous, i.e. they all contain the same parts. > > However, if segments are heterogenous, i.e some of them are processed > further than others, we cannot merge all their parts, because we will > get an incomplete segment as a result (e.g for some urls we will have > parse_data, for other urls it will be missing). > > In such cases SegmentMerger processes only the lowest common > denominator, i.e. only those segment parts that are present in all input > segments - and disregards any other existing parts. > > That's a long answer to your problem, which is that one or more of your > input segments isn't fetched yet - don't include that segment on the > list of input segments, and all should be fine. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
