Re: com.ctc.wstx.exc.WstxLazyException exception while passing the text content of a word doc to SOLR

2009-03-17 Thread Chris Hostetter

: I am using Apache POI parser to parse a Word Doc and extract the text
: content. Then i am passing the text content to SOLR. The Word document has
: many pictures, graphs and tables. But when i am passing the content to SOLR,
: it fails. Here is the exception trace.
: 
: 09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
: org.apache.solr.common.SolrException log
: SEVERE: [com.ctc.wstx.exc.WstxLazyException]
: com.ctc.wstx.exc.WstxParsingException: Illegal charact
: er entity: expansion character (code 0x7) not a valid XML character
:  at [row,col {unknown-source}]: [40,18]

the error string is fairly self explanatory: on line 40, column 18 you 
have a character that isn't legal in XML (0x7)

(not all UTF-8 characters are legal in XML)

If search the solr archives for Illegal character you'll find lots of 
discussion about this and how to deal with this in general.

You might also want to check out this comment pointing out some advantages 
in using Tika instead of using POI directly...

https://issues.apache.org/jira/browse/LUCENE-1559?#action_12681347

..lastly you might wnat to check out this plugin and do all hte hard work 
server side...

http://wiki.apache.org/solr/ExtractingRequestHandler




-Hoss



com.ctc.wstx.exc.WstxLazyException exception while passing the text content of a word doc to SOLR

2009-03-13 Thread Suryasnat Das
Hi,

I am using Apache POI parser to parse a Word Doc and extract the text
content. Then i am passing the text content to SOLR. The Word document has
many pictures, graphs and tables. But when i am passing the content to SOLR,
it fails. Here is the exception trace.

09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
org.apache.solr.common.SolrException log
SEVERE: [com.ctc.wstx.exc.WstxLazyException]
com.ctc.wstx.exc.WstxParsingException: Illegal charact
er entity: expansion character (code 0x7) not a valid XML character
 at [row,col {unknown-source}]: [40,18]
at
com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327
)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.ja
va:195)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandle
r.java:123)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.j
ava:190)
at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextE
stablishmentValve.java:126)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEs
tablishmentValve.java:70)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java
:158)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.j
ava:601)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:595).

Another error trace relating to POI is also throwing up:

09:31:04,828 ERROR [STDERR] java.io.IOException: Unable to read entire
header; 130 bytes read; expe
cted 512 bytes
09:31:04,828 ERROR [STDERR] at
org.apache.poi.poifs.storage.HeaderBlockReader.alertShortRead(He
aderBlockReader.java:130)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.poifs.storage.HeaderBlockReader.init(HeaderBloc
kReader.java:94)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.poifs.filesystem.POIFSFileSystem.init(POIFSFile
System.java:151)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocumen
t.java:133)
09:31:04,843 ERROR [STDERR] at
org.apache.poi.hwpf.extractor.WordExtractor.init(WordExtractor
.java:51)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.parseWordFile(SearchA
pplicationServlet.java:963)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.indexDirectory(Search
ApplicationServlet.java:813)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.index(SearchApplicati
onServlet.java:747)
09:31:04,859 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.processAdd(SearchAppl
icationServlet.java:331)
09:31:04,874 ERROR [STDERR] at
com.apple.servlet.SearchApplicationServlet.doGet(SearchApplicati
onServlet.java:160)
09:31:04,874 ERROR [STDERR] at