djames wrote: > Hi all, > > I got a probleme with parser when i try to crawl 2000 site with a depth of > 3. > I use nutch 0.81 version and my setup worked well with other site but this > list gave me this error: > > 2007-06-06 13:49:27,997 WARN mapred.LocalJobRunner - job_qsjobz > java.lang.StackOverflowError > at org.apache.xerces.dom.ParentNode.getLength(Unknown Source) > at > org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305)
I've seen this on some occasions, but I haven't discovered the real reason for this error yet - for now I suggest that you modify the source of DOMContentUtils to artificially limit the level of recursion in getOutlinks to something like 200-300. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
