Hello, while crawling a large batch of documents i encountered a problem
with ooParser. It wouldn't be a big deal, however after that Fetcher2
stopped fetching completely so it looks like i'll have to kill it, which is
a waste of 800 000 fetched documents... Guess i'll have to fetch in smaller
batches. If you have any idea how to resume hung fetcher let me know...

The exception text:

2007-06-28 12:45:32,775 WARN  oo.OOParser - org.jdom.JDOMException: Error in
building: /nutch/search/office.dtd (No such file or directory)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.jdom.input.SAXBuilder.build(SAXBuilder.java:373)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.jdom.input.SAXBuilder.build(SAXBuilder.java:673)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.parse.oo.OOParser.parseContent(OOParser.java:113)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.parse.oo.OOParser.getParse(OOParser.java:82)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.fetcher.Fetcher2$FetcherThread.output(Fetcher2.java:669)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:511)
2007-06-28 12:45:32,775 WARN  oo.OOParser - Caused by:
java.io.FileNotFoundException: /nutch/search/office.dtd (No such file or
directory)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
java.io.FileInputStream.open(Native
Method)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at java.io.FileInputStream
.<init>(FileInputStream.java:106)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at java.io.FileInputStream
.<init>(FileInputStream.java:66)
2007-06-28 12:45:32,775 WARN  oo.OOParser - at
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java
:70)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
sun.net.www.protocol.file.FileURLConnection.getInputStream(
FileURLConnection.java:161)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown
Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
2007-06-28 12:45:32,776 WARN  oo.OOParser - at
org.jdom.input.SAXBuilder.build(SAXBuilder.java:354)
2007-06-28 12:45:32,776 WARN  oo.OOParser - ... 6 more
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to