Dear community, I am experiencing strange problem while trying to index / to import XML document to SOLR via DataImportHandler. The XML document contains some special characters (e.g. german ΓΌ) that are represented as XML entities ü or ä. There is also DTD file that defines these entities (<!ENTITY uuml "ü" >) (I tried to use dtd file as well as to include the DTD definition to the xml itself). After I start the import command full-import, the import process throws an exception as soon as it tries to parse ü: "Un declared general entity "uuml". Did anyone already face such a problem?
best regards, Michael My data-config for importing is: <dataConfig> <dataSource type="FileDataSource" encoding="ISO-8859-1" /> <document> <!-- stream should be true since huge xml document is being parsed --> <entity name="article" processor="XPathEntityProcessor" stream="true" forEach="/dblp/article" url="documents/dblp.xml" > <field column="key" xpath="/dblp/article/@key" /> <field column="title" xpath="/dblp/article/title" /> </entity> </document> </dataConfig> The XML file looks e.g. like this: <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE dblp [ <!ENTITY uuml "ü" ><!-- small u, dieresis or umlaut mark --> ]> <dblp> <article key="journals/fm/Riccardi09" mdate="2011-10-27"> <author>Marco Riccardi</author> <title>Solution of Cubic and Quartic Equations.ü</title> <pages>117-122</pages> <year>2009</year> <volume>17</volume> <journal>Formalized Mathematics</journal> <number>1-4</number> <ee>http://dx.doi.org/10.2478/v10037-009-0012-z</ee><url>db/journals/fm/fm17.html#Riccardi09</url> </article></dblp> The stack-trace is: 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1 05.07.2012 17:37:19 org.apache.solr.common.SolrException log SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeE xception: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row in this xml:{title=Common Subexpression Identification in General Algebraic System s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :264) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:375) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:445) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:426) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataIm portHandlerException: Parsing failed for xml, url:documents/dblp.xml rows proces sed in this xml:2 last row in this xml:{title=Common Subexpression Identificatio n in General Algebraic Systems., $forEach=/dblp/article, key=persons/Hall74} Pro cessing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:621) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:327) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :225) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsin g failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row i n this xml:{title=Common Subexpression Identification in General Algebraic Syste ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd Throw(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:504) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:517) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity ProcessorBase.java:120) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow( XPathEntityProcessor.java:225) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath EntityProcessor.java:204) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent ityProcessorWrapper.java:330) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:296) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:683) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:619) ... 5 more Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Un declared general entity "uuml" at [row,col {unknown-source}]: [26,42] at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP athRecordReader.java:187) at org.apache.solr.handler.dataimport.XPathEntityProcessor$2.run(XPathEn tityProcessor.java:427) Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared general entity "uum l" at [row,col {unknown-source}]: [26,42] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.jav a:630) at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:467) at com.ctc.wstx.sr.BasicStreamReader.handleUndeclaredEntity(BasicStreamR eader.java:5431) at com.ctc.wstx.sr.StreamScanner.expandUnresolvedEntity(StreamScanner.ja va:1661) at com.ctc.wstx.sr.StreamScanner.expandEntity(StreamScanner.java:1555) at com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1 523) at com.ctc.wstx.sr.BasicStreamReader.skipTokenText(BasicStreamReader.jav a:3568) at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:33 42) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java :2622) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart Element(XPathRecordReader.java:376) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath RecordReader.java:310) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart Element(XPathRecordReader.java:346) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath RecordReader.java:310) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.handleStart Element(XPathRecordReader.java:346) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPath RecordReader.java:310) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$200( XPathRecordReader.java:202) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP athRecordReader.java:184) ... 1 more 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback 05.07.2012 17:37:19 org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback