On Jan 4, 2007, at 10:47 AM, Dennis Kubes wrote:

> What nutch version are you using and what is your setup.  An 80K  
> reparse should only take a few minutes at most.


Hi, not sure if my followup mail got through, but I found out that my  
re-parse hang was coming from the parse-mp3 plugin -- it was hanging  
on a particular mp3 file. I'm looking into it...

That said, my 80K reparse (after taking out parse-mp3) took about 30  
minutes. On a dual Xeon 3.0 debian machine with 4GB RAM, running the  
nutch nightly from two days ago. Does this seem slower than normal?





> Brian Whitman wrote:
>> On yesterdays nutch-nightly, from Dennis Kubes suggestions on how  
>> to normalize URLs, I removed the parsed folders via
>> rm -rf crawl_parse parse_data parse_text
>> from a recent crawl so I could re-parse the crawl using a regex  
>> urlnormalizer.
>> I ran bin/nutch parse crawl/segments/2007.... on a 80K document  
>> segment.
>> The hadoop log (set to INFO) showed a lot of warnings on  
>> unparsable documents, with a mapred.JobClient -  map XX% reduce 0%  
>> ticker steadily going up. It then  stopped at map 49% with no more  
>> warnings or info, and has been that way for about 6 hours. Top  
>> shows java at 99% CPU.
>> Is it hung or should re-parsing an already crawled segment take  
>> this long? Shouldn't hadoop be showing the parse progress?
>> To test I killed the process and set my nutch-site back to the  
>> original -- no url normalizer. No change-- still hangs in the same  
>> spot. Any ideas?
>> -Brian

--
http://variogr.am/
[EMAIL PROTECTED]




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to