djames wrote:
> Hi all,
> 
> I got a probleme with parser when i try to crawl 2000 site with a depth of
> 3.
> I use nutch 0.81 version and my setup worked well with other site but this
> list gave me this error:
> 
> 2007-06-06 13:49:27,997 WARN  mapred.LocalJobRunner - job_qsjobz
> java.lang.StackOverflowError
>       at org.apache.xerces.dom.ParentNode.getLength(Unknown Source)
>       at
> org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305)

I've seen this on some occasions, but I haven't discovered the real 
reason for this error yet - for now I suggest that you modify the source 
of DOMContentUtils to artificially limit the level of recursion in 
getOutlinks to something like 200-300.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to