Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
Shalin Downloaded nightly for 21jan and tried DIH again. Its better but still broken. Dozens of embeded tags are stripped from documents but it now fails every few documents for no reason I can see. Manually removing embeded tags causes a given problem document to be indexed, only to have a it fail on one of the next few documents. I think the problem is still in stripHTML Here is the traceback. Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 3377 ms Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=13 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232539612131 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jc document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: java.util.NoSuchElementException at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
Hi Fergus, It seems a field it is expecting is missing from the XML. field column=fileAbsPath template=${jcurrent.fileAbsolutePath} / field column=fileWebPath regex=/Volumes/spare/ts/(.*) replaceWith=$1 sourceColName=*fileAbsePath*/ I guess fileAbsePath is a typo? Can you check if that is the cause? On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie fer...@twig.me.uk wrote: Shalin Downloaded nightly for 21jan and tried DIH again. Its better but still broken. Dozens of embeded tags are stripped from documents but it now fails every few documents for no reason I can see. Manually removing embeded tags causes a given problem document to be indexed, only to have a it fail on one of the next few documents. I think the problem is still in stripHTML Here is the traceback. Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 3377 ms Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=13 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232539612131 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jc document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: java.util.NoSuchElementException at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
Hi Fergus, It seems a field it is expecting is missing from the XML. You mean there is some field in the document we are indexing that is missing? field column=fileAbsPath template=${jcurrent.fileAbsolutePath} / field column=fileWebPath regex=/Volumes/spare/ts/(.*) replaceWith=$1 sourceColName=*fileAbsePath*/ I guess fileAbsePath is a typo? Can you check if that is the cause? Well spotted. I had made a mess of sanitizing the config file I sent to you. I will in future make sure the stuff I am messing with matches what I send to the list. However there is no typo in the underlying file; at least not on that line:-) On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie fer...@twig.me.uk wrote: Shalin Downloaded nightly for 21jan and tried DIH again. Its better but still broken. Dozens of embeded tags are stripped from documents but it now fails every few documents for no reason I can see. Manually removing embeded tags causes a given problem document to be indexed, only to have a it fail on one of the next few documents. I think the problem is still in stripHTML Here is the traceback. Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 3377 ms Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=13 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232539612131 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jc document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: java.util.NoSuchElementException at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at
Cant get HTMLStripTransformer's stripHTML to work in DIH.
Hello all, I have the following DIH data-config.xml file. Adding HTMLStripTransformer and the associated stripHTML on the para tag seems to have broke things. I am using a nightly build from 12-jan-2009 The /record/sect1/para contains HTML sub tags which need to be discarded. Is my use of stripHTML correct? dataConfig dataSource name=myfilereader type=FileDataSource/ document entity name=jcurrent processor=FileListEntityProcessor fileName=.*xml newerThan='NOW-1000DAYS' recursive=true rootEntity=false dataSource=null baseDir=/Volumes/spare/ts/jxml/data/news/groups entity name=x dataSource=myfilereader processor=XPathEntityProcessor url=${jcurrent.fileAbsolutePath} stream=false forEach=/record transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer field column=fileAbsPath template=${jcurrent.fileAbsolutePath} / field column=fileWebPath regex=/Volumes/spare/ts/(.*) replaceWith=$1 sourceColName=fileAbsePath/ field column=titlexpath=/record/title / field column=para xpath=/record/sect1/para stripHTML=true / field column=subject xpath=/record/metadata/subje...@qualifier='fullTitle'] / field column=pubname xpath=/record/metadata/subje...@qualifier='publication'] / field column=pubdate xpath=/record/metadata/da...@qualifier='pubDate'] dateTimeFormat=MMdd / /entity /entity /document /dataConfig -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
This looks fine. Can you post the stack trace? Yep, here is the juicy bit. Let me know if you need more. Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start INFO: Server startup in 2390 ms Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute INFO: [janesdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=12 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2] Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232363283059 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer WARNING: transformer threw error java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:33) at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: janescurrent document : null org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:33) at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) ... 9 more Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) at
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
Hmmm, Just to clarify I retested the thing using the nightly as of today 18-jan-2009. The problem is still there and this traceback is from that nightly. This looks fine. Can you post the stack trace? Yep, here is the juicy bit. Let me know if you need more. Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start INFO: Server startup in 2390 ms Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute INFO: [janesdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=12 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2] Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232363283059 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer WARNING: transformer threw error java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:33) at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: janescurrent document : null org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:33) at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) ... 9 more Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
Ah, it needs a null check for multi valued fields. I've committed a fix to trunk. The next nightly build should have it. You can checkout and build from the trunk if need this immediately. On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie fer...@twig.me.uk wrote: Hmmm, Just to clarify I retested the thing using the nightly as of today 18-jan-2009. The problem is still there and this traceback is from that nightly. This looks fine. Can you post the stack trace? Yep, here is the juicy bit. Let me know if you need more. Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start INFO: Server startup in 2390 ms Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute INFO: [janesdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=12 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2] Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232363283059 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer WARNING: transformer threw error java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:33) at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: janescurrent document : null org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:33) at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) ... 9 more Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full