I'm trying to understand the context is here... are you trying to crawl web pages that have bad HTML? Or, ... what?

-- Jack Krupansky

-----Original Message----- From: eShard
Sent: Thursday, April 04, 2013 10:23 AM
To: solr-user@lucene.apache.org
Subject: detailed Error reporting in Solr

Good morning,
I'm currently running Solr 4.0 final with tika v1.2 and Manifoldcf v1.2 dev.
And I'm battling Tika XML parse errors again.
Solr reports this error:  org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error which is too vague.
I had to manually run the link against the tika app and I got a much more
detailed error.
Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105;
The entity "nbsp" was referenced, but not declared.
so there are old school non break space in the html that tika can't handle.

for example: <li> Cyber Systems and Technology&nbsp;&rsaquo;
</mission/CST/CST.html>   </li>

My question is two fold:
1) how do I get solr to report more detailed errors and
2) how do I get tika to accept (or ignore) nbsp?

thanks,




--
View this message in context: http://lucene.472066.n3.nabble.com/detailed-Error-reporting-in-Solr-tp4053821.html Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to