Re: [Nutch-general] re-parse hang?

Brian Whitman Thu, 04 Jan 2007 10:56:47 -0800

I did that kill -SIGQUIT thing on the parse hang-- looks like jid3lib  
has a problem... but if jid3lib throws an exception, shouldn't the  
parse-mp3 plugin and nutch pick it up and continue? (Excuse my java  
lack of knowledge...)



Full thread dump Java HotSpot(TM) Server VM (1.5.0_10-b03 mixed mode):

"Thread-0" prio=1 tid=0xa269d5e8 nid=0x797f runnable  
[0xa2b7e000..0xa2b7f040]
         at java.lang.Throwable.fillInStackTrace(Native Method)
         at java.lang.Throwable.<init>(Throwable.java:196)
         at java.lang.Exception.<init>(Exception.java:41)
         at org.farng.mp3.TagException.<init>(Unknown Source)
         at org.farng.mp3.InvalidTagException.<init>(Unknown Source)
         at org.farng.mp3.id3.ID3v2_3Frame.read(Unknown Source)
         at org.farng.mp3.id3.ID3v2_3Frame.<init>(Unknown Source)
         at org.farng.mp3.id3.ID3v2_3.read(Unknown Source)
         at org.farng.mp3.id3.ID3v2_3.<init>(Unknown Source)
         at org.farng.mp3.MP3File.<init>(Unknown Source)
         at org.farng.mp3.MP3File.<init>(Unknown Source)
         at org.apache.nutch.parse.mp3.MP3Parser.getParse 
(MP3Parser.java:69)
         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java: 
76)
         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
         at org.apache.hadoop.mapred.LocalJobRunner$Job.run 
(LocalJobRunner.java:105)

"Low Memory Detector" daemon prio=1 tid=0x08123d40 nid=0x797d  
runnable [0x00000000..0x00000000]

"CompilerThread1" daemon prio=1 tid=0x081228b0 nid=0x797c waiting on  
condition [0x00000000..0x9f4c51e8]

"CompilerThread0" daemon prio=1 tid=0x08121848 nid=0x797b waiting on  
condition [0x00000000..0x9f445068]

"AdapterThread" daemon prio=1 tid=0x081207b0 nid=0x797a waiting on  
condition [0x00000000..0x00000000]

"Signal Dispatcher" daemon prio=1 tid=0x0811f8b8 nid=0x7979 waiting  
on condition [0x00000000..0x00000000]

"Finalizer" daemon prio=1 tid=0x08116138 nid=0x7978 in Object.wait()  
[0x9f27f000..0x9f27f1c0]
         at java.lang.Object.wait(Native Method)
         - waiting on <0x601c8db8> (a java.lang.ref.ReferenceQueue$Lock)
         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
         - locked <0x601c8db8> (a java.lang.ref.ReferenceQueue$Lock)
         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
         at java.lang.ref.Finalizer$FinalizerThread.run 
(Finalizer.java:159)

"Reference Handler" daemon prio=1 tid=0x08114ae8 nid=0x7977 in  
Object.wait() [0x9f1fe000..0x9f1ff040]
         at java.lang.Object.wait(Native Method)
         - waiting on <0x601b7da0> (a java.lang.ref.Reference$Lock)
         at java.lang.Object.wait(Object.java:474)
         at java.lang.ref.Reference$ReferenceHandler.run 
(Reference.java:116)
         - locked <0x601b7da0> (a java.lang.ref.Reference$Lock)

"main" prio=1 tid=0x0805d538 nid=0x796e waiting on condition  
[0xffff9000..0xffff9e48]
         at java.lang.Thread.sleep(Native Method)
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
367)
         at org.apache.nutch.parse.ParseSegment.parse 
(ParseSegment.java:130)
         at org.apache.nutch.parse.ParseSegment.main 
(ParseSegment.java:148)

"VM Thread" prio=1 tid=0x081125f8 nid=0x7976 runnable

"GC task thread#0 (ParallelGC)" prio=1 tid=0x08077b48 nid=0x7972  
runnable

"GC task thread#1 (ParallelGC)" prio=1 tid=0x08078750 nid=0x7973  
runnable

"GC task thread#2 (ParallelGC)" prio=1 tid=0x08079340 nid=0x7974  
runnable

"GC task thread#3 (ParallelGC)" prio=1 tid=0x08079f30 nid=0x7975  
runnable

"VM Periodic Task Thread" prio=1 tid=0x081252c0 nid=0x797e waiting on  
condition



On Jan 4, 2007, at 11:12 AM, Brian Whitman wrote:

>
> On Jan 4, 2007, at 10:47 AM, Dennis Kubes wrote:
>
>> What nutch version are you using and what is your setup.  An 80K  
>> reparse should only take a few minutes at most.
>
>
> Hi, not sure if my followup mail got through, but I found out that  
> my re-parse hang was coming from the parse-mp3 plugin -- it was  
> hanging on a particular mp3 file. I'm looking into it...
>
> That said, my 80K reparse (after taking out parse-mp3) took about  
> 30 minutes. On a dual Xeon 3.0 debian machine with 4GB RAM, running  
> the nutch nightly from two days ago. Does this seem slower than  
> normal?
>
>
>
>
>
>> Brian Whitman wrote:
>>> On yesterdays nutch-nightly, from Dennis Kubes suggestions on how  
>>> to normalize URLs, I removed the parsed folders via
>>> rm -rf crawl_parse parse_data parse_text
>>> from a recent crawl so I could re-parse the crawl using a regex  
>>> urlnormalizer.
>>> I ran bin/nutch parse crawl/segments/2007.... on a 80K document  
>>> segment.
>>> The hadoop log (set to INFO) showed a lot of warnings on  
>>> unparsable documents, with a mapred.JobClient -  map XX% reduce  
>>> 0% ticker steadily going up. It then  stopped at map 49% with no  
>>> more warnings or info, and has been that way for about 6 hours.  
>>> Top shows java at 99% CPU.
>>> Is it hung or should re-parsing an already crawled segment take  
>>> this long? Shouldn't hadoop be showing the parse progress?
>>> To test I killed the process and set my nutch-site back to the  
>>> original -- no url normalizer. No change-- still hangs in the  
>>> same spot. Any ideas?
>>> -Brian
>
> --
> http://variogr.am/
> [EMAIL PROTECTED]
>
>
>

--
http://variogr.am/
[EMAIL PROTECTED]




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] re-parse hang?

Reply via email to