Whatever programming language you are using probably has a function that
makes "xml-safe" text. For example, I'm using Coldfusion to integrate with
Solr and all data is set like follows:

#xmlformat(usergeneratedcontent)#

My guess is PHP, ASP, etc. all have a function like this


On 9/20/06, Nick Snels <[EMAIL PROTECTED]> wrote:

Hi,

I want users to add content to my site using tinyMCE, which generates
HTML.
When I tried adding the data to Solr, Solr refused to add it (or at least
generated an error):

SEVERE: org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
or TEXT to read text (position: START_TAG seen ...<field
name="text"><p>...
@4:39)
   at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1071)
   at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:910)
   at org.apache.solr.core.SolrCore.update(SolrCore.java:685)
   at org.apache.solr.servlet.SolrUpdateServlet.doPost(
SolrUpdateServlet.java:52)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
   at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:252)
   at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:173)
   at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
   at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:178)
   at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java:126)
   at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:105)
   at org.apache.catalina.valves.RequestFilterValve.process(
RequestFilterValve.java:275)
   at org.apache.catalina.valves.RemoteAddrValve.invoke(
RemoteAddrValve.java:80)
   at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:107)
   at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:148)
   at org.apache.coyote.http11.Http11Processor.process(
Http11Processor.java
:869)
   at

org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:664)
   at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:527)
   at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:80)
   at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:684)
   at java.lang.Thread.run(Thread.java:595)

So I searched the archives to resolve this issue, since I didn't want to
strip out the HTML entirely. The solution proved to be to add <![CDATA[
around the HTML text, like so:

<add><doc>
  <field name="text"><![CDATA[#{field.text}]]></field>
</add></doc>

This also drew my attention to another problem, characters like < > & are
all 'invalid' characters between xml tags. So that would mean, I have to
put
<![CDATA[ around all the fields I want to index!? Because I don't know or
cann't control what my users will input. Is this the only solution or is
their a way for Solr to handle these 'invalid' characters in the indexed
text by itself, without generating errors?

Kind regards,

Nick


Reply via email to