On 6/18/07, Sean Dean <[EMAIL PROTECTED]> wrote:
> Your patch seemed to do the trick with the segment reader. Here was the 
> output, I have removed most of the page content as it would otherwise turn 
> this email into a massive pile of junk in some email clients.

I hate to say this, but I have absolutely no idea why this is happening.

When nutch fetches a page, it puts current segment's name into
content's metadata. During parse, this passes from content to parse
data. And finally, during index, indexer picks it from parse data and
stores segment name in index. For some reason in your case parse data
does not have this information, even though content does.

I have gone through all code paths that are responsible for copying
segment name from content to parse (Fetcher, Fetcher2, ParseSegment)
and I just do not see what the error is.

If anyone has any idea, now would be good time to say it :).

>
> [command output]
>
> Version: 2
> url: http://acadisc.com/
> base: http://acadisc.com/
> contentType: text/html
> metadata: Content-Length=9292 Connection=close ETag="3969a30-244c-462eae09" 
> nutch.segment.name=20070607150908 nutch.crawl.score=1.0 Date=Fri, 08 Jun 2007 
> 22:05:45 GMT Accept-Ranges=bytes Server=Apache Content-Type=text/html 
> Last-Modified=Wed, 25 Apr 2007 01:25:29 GMT
> Content:
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML>
>
>
> [html source content - removed]
>
> </HTML>
>
> Crawl Fetch::
> Version: 5
> Status: 33 (fetch_success)
> Fetch time: Fri Jun 08 18:11:25 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 seconds (3.4722223E-4 days)
> Score: 1.0
> Signature: c079280b4afb4347372982d5a034d51b
> Metadata: _ngt_:1181243348572 _pst_:success(1), lastModified=0
>
>
> ----- Original Message ----
> From: Doğacan Güney <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, June 18, 2007 2:01:14 AM
> Subject: Re: Indexing problems in nutch-nightly
>
>
> On 6/18/07, Sean Dean <[EMAIL PROTECTED]> wrote:
> > There was no result due to the fact it does not complete and the process 
> > just hangs with zero processor utilization. There was nothing in the logs 
> > to show you, but I took a stack trace before killing the process completely 
> > and here it is;
> >
> > Full thread dump Java HotSpot(TM) 64-Bit Server VM (diablo-1.5.0_07-b01 
> > mixed mode):
> > "Low Memory Detector" daemon prio=5 tid=0x00000000006cfc00 nid=0x6d5800 
> > runnable [0x0000000000000000..0x0000000000000000]
> > "CompilerThread1" daemon prio=9 tid=0x00000000006c9c00 nid=0x6cf800 waiting 
> > on condition [0x0000000000000000..0x00007fffff1f4320]
> > "CompilerThread0" daemon prio=9 tid=0x00000000006c3c00 nid=0x6c9800 waiting 
> > on condition [0x0000000000000000..0x00007fffff2f5400]
> > "AdapterThread" daemon prio=9 tid=0x00000000006bac00 nid=0x6c3800 waiting 
> > on condition [0x0000000000000000..0x0000000000000000]
> > "Signal Dispatcher" daemon prio=9 tid=0x00000000006a7c00 nid=0x6ba800 
> > waiting on condition [0x0000000000000000..0x0000000000000000]
> > "Finalizer" daemon prio=8 tid=0x00000000006a7000 nid=0x6a7800 in 
> > Object.wait() [0x00007fffff5f9000..0x00007fffff5f9910]
> >         at java.lang.Object.wait(Native Method)
> >         - waiting on <0x00000008b7860ad0> (a 
> > java.lang.ref.ReferenceQueue$Lock)
> >         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
> >         - locked <0x00000008b7860ad0> (a java.lang.ref.ReferenceQueue$Lock)
> >         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
> >         at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> > "Reference Handler" daemon prio=10 tid=0x000000000062b800 nid=0x62bc00 in 
> > Object.wait() [0x00007fffff6fa000..0x00007fffff6fac90]
> > "main" prio=5 tid=0x0000000000516800 nid=0x516000 waiting on condition 
> > [0x00007fffffffc000..0x00007fffffffd2f0]
> >         at java.lang.Thread.sleep(Native Method)
> >         at 
> > org.apache.nutch.segment.SegmentReader.get(SegmentReader.java:348)
> >         at 
> > org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:590)
> > "VM Thread" prio=9 tid=0x000000000065f200 nid=0x62b400 runnable
> > "GC task thread#0 (ParallelGC)" prio=5 tid=0x0000000000527c00 nid=0x5af400 
> > runnable
> > "GC task thread#1 (ParallelGC)" prio=5 tid=0x00000000005b5200 nid=0x5bd000 
> > runnable
> > "VM Periodic Task Thread" prio=9 tid=0x0000000000527800 nid=0x6dc800 
> > waiting on condition
> >
> >
> >
>
> Ah, non-debuggable problems.... so much fun:)
>
> Anyway, it seems you are running into the problem described here:
> http://www.nabble.com/bug-in-SegmentReader-tf3788992.html
>
> I have put up a "patchified" version here:
> http://www.ceng.metu.edu.tr/~e1345172/segment_reader_hang.patch
>
> Can you retry with this patch?
>
> Thanks!
>
> --
> Doğacan Güney


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to