Hi,
I am using Apache POI parser to parse a Word Doc and extract the text
content. Then i am passing the text content to SOLR. The Word document has
many pictures, graphs and tables. But when i am passing the content to SOLR,
it fails. Here is the exception trace.
09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
org.apache.solr.common.SolrException log
SEVERE: [com.ctc.wstx.exc.WstxLazyException]
com.ctc.wstx.exc.WstxParsingException: Illegal charact
er entity: expansion character (code 0x7) not a valid XML character
at [row,col {unknown-source}]: [40,18]
at
com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327
)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.ja
va:195)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandle
r.java:123)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.j
ava:190)
at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextE
stablishmentValve.java:126)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEs
tablishmentValve.java:70)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java
:158)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.j
ava:601)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:595).
Another error trace relating to POI is also throwing up:
09:31:04,828 ERROR [STDERR] java.io.IOException: Unable to read entire
header; 130 bytes read; expe
cted 512 bytes
09:31:04,828 ERROR [STDERR] at
org.apache.poi.poifs.storage.HeaderBlockReader.alertShortRead(He
aderBlockReader.java:130)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.poifs.storage.HeaderBlockReader.init(HeaderBloc
kReader.java:94)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.poifs.filesystem.POIFSFileSystem.init(POIFSFile
System.java:151)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocumen
t.java:133)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.hwpf.extractor.WordExtractor.init(WordExtractor
.java:51)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.parseWordFile(SearchA
pplicationServlet.java:963)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.indexDirectory(Search
ApplicationServlet.java:813)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.index(SearchApplicati
onServlet.java:747)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.processAdd(SearchAppl
icationServlet.java:331)
09:31:04,874 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.doGet(SearchApplicati
onServlet.java:160)
09:31:04,874 ERROR [STDERR] at