Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie
Shalin

Downloaded nightly for 21jan and tried DIH again. Its better but
still broken. Dozens of embeded tags are stripped from documents
but it now fails every few documents for no reason I can see. Manually
removing embeded tags causes a given problem document to be indexed,
only to have a it fail on one of the next few documents. I think the
problem is still in stripHTML

Here is the traceback.

Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 3377 ms
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=13 
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2

commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]

commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232539612131
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
SEVERE: Exception while processing: jc document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 
Processing Document # 9
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
... 9 more
Caused by: java.util.NoSuchElementException
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 
Processing Document # 9
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Shalin Shekhar Mangar
Hi Fergus,

It seems a field it is expecting is missing from the XML.

field column=fileAbsPath template=${jcurrent.fileAbsolutePath} /
field column=fileWebPath regex=/Volumes/spare/ts/(.*) replaceWith=$1
sourceColName=*fileAbsePath*/

I guess fileAbsePath is a typo? Can you check if that is the cause?


On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie fer...@twig.me.uk wrote:

 Shalin

 Downloaded nightly for 21jan and tried DIH again. Its better but
 still broken. Dozens of embeded tags are stripped from documents
 but it now fails every few documents for no reason I can see. Manually
 removing embeded tags causes a given problem document to be indexed,
 only to have a it fail on one of the next few documents. I think the
 problem is still in stripHTML

 Here is the traceback.

 Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
 INFO: Server startup in 3377 ms
 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
 readIndexerProperties
 INFO: Read dataimport.properties
 Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
 INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
 status=0 QTime=13
 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 INFO: Starting Full Import
 Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
 deleteAll
 INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
 Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=2

  
 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]

  
 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
 Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
 updateCommits
 INFO: last commit = 1232539612131
 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
 buildDocument
 SEVERE: Exception while processing: jc document : null
 org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
 failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
 Processing Document # 9
at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
 at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
 Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
at
 org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
... 9 more
 Caused by: java.util.NoSuchElementException
at
 com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
at
 org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at
 org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
at
 org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
at
 org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 SEVERE: Full Import failed
 org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
 failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
 Processing Document # 9
at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
 at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at
 

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie
Hi Fergus,

It seems a field it is expecting is missing from the XML.

You mean there is some field in the document we are indexing
that is missing?

field column=fileAbsPath template=${jcurrent.fileAbsolutePath} /
field column=fileWebPath regex=/Volumes/spare/ts/(.*) replaceWith=$1
sourceColName=*fileAbsePath*/

I guess fileAbsePath is a typo? Can you check if that is the cause?
Well spotted. I had made a mess of sanitizing the config file I sent
to you. I will in future make sure the stuff I am messing with matches
what I send to the list. However there is no typo in the underlying file;
at least not on that line:-) 




On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie fer...@twig.me.uk wrote:

 Shalin

 Downloaded nightly for 21jan and tried DIH again. Its better but
 still broken. Dozens of embeded tags are stripped from documents
 but it now fails every few documents for no reason I can see. Manually
 removing embeded tags causes a given problem document to be indexed,
 only to have a it fail on one of the next few documents. I think the
 problem is still in stripHTML

 Here is the traceback.

 Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
 INFO: Server startup in 3377 ms
 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
 readIndexerProperties
 INFO: Read dataimport.properties
 Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
 INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
 status=0 QTime=13
 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 INFO: Starting Full Import
 Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
 deleteAll
 INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
 Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=2

  
 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]

  
 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
 Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
 updateCommits
 INFO: last commit = 1232539612131
 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
 buildDocument
 SEVERE: Exception while processing: jc document : null
 org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
 failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
 Processing Document # 9
at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
 at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
 Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
at
 org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
... 9 more
 Caused by: java.util.NoSuchElementException
at
 com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
at
 org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at
 org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
at
 org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
at
 org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 SEVERE: Full Import failed
 org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
 failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
 Processing Document # 9
at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
 

Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-19 Thread Fergus McMenemie
Hello all,

I have the following DIH data-config.xml file. Adding 
HTMLStripTransformer and the associated stripHTML on the 
para tag seems to have broke things. I am using a nightly 
build from 12-jan-2009

The /record/sect1/para contains HTML sub tags which need
to be discarded. Is my use of stripHTML correct?

dataConfig
 dataSource name=myfilereader type=FileDataSource/
  document
 entity name=jcurrent
processor=FileListEntityProcessor
fileName=.*xml
newerThan='NOW-1000DAYS'
recursive=true
rootEntity=false
dataSource=null
baseDir=/Volumes/spare/ts/jxml/data/news/groups

entity name=x
   dataSource=myfilereader
   processor=XPathEntityProcessor
   url=${jcurrent.fileAbsolutePath}
   stream=false
   forEach=/record
   
transformer=DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer

   field column=fileAbsPath template=${jcurrent.fileAbsolutePath} 
/
   field column=fileWebPath regex=/Volumes/spare/ts/(.*) 
replaceWith=$1 sourceColName=fileAbsePath/
   field column=titlexpath=/record/title /
   field column=para xpath=/record/sect1/para stripHTML=true 
/
   field column=subject  
xpath=/record/metadata/subje...@qualifier='fullTitle']   /
   field column=pubname  
xpath=/record/metadata/subje...@qualifier='publication'] /
   field column=pubdate  
xpath=/record/metadata/da...@qualifier='pubDate'] dateTimeFormat=MMdd   
/
   /entity
/entity
 /document
  /dataConfig

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-19 Thread Fergus McMenemie
This looks fine. Can you post the stack trace?

Yep, here is the juicy bit. Let me know if you need more.

Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
INFO: Server startup in 2390 ms
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
INFO: [janesdocs] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=12 
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2

commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]

commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232363283059
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.EntityProcessorBase 
applyTransformer
WARNING: transformer threw error
java.lang.NullPointerException
at java.io.StringReader.init(StringReader.java:33)
at 
org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
at 
org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
SEVERE: Exception while processing: janescurrent document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.NullPointerException
at java.io.StringReader.init(StringReader.java:33)
at 
org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
at 
org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
... 9 more
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
at 

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-19 Thread Fergus McMenemie
Hmmm,

Just to clarify I retested the thing using the nightly as of today
18-jan-2009. The problem is still there and this traceback is from
that nightly. 

This looks fine. Can you post the stack trace?

Yep, here is the juicy bit. Let me know if you need more.

Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
INFO: Server startup in 2390 ms
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
INFO: [janesdocs] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=12 
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2
   
 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
   
 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232363283059
Jan 19, 2009 11:14:06 AM 
org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
WARNING: transformer threw error
java.lang.NullPointerException
   at java.io.StringReader.init(StringReader.java:33)
   at 
 org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
   at 
 org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
   at 
 org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
   at 
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
   at 
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
   at 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
   at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
   at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
SEVERE: Exception while processing: janescurrent document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.NullPointerException
   at 
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
   at 
 org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
   at 
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
   at 
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
   at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
   at 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
   at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
   at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.NullPointerException
   at java.io.StringReader.init(StringReader.java:33)
   at 
 org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
   at 
 org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
   at 
 org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
   ... 9 more
Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.NullPointerException
   at 
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
   at 
 

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-19 Thread Shalin Shekhar Mangar
Ah, it needs a null check for multi valued fields. I've committed a fix to
trunk. The next nightly build should have it. You can checkout and build
from the trunk if need this immediately.

On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie fer...@twig.me.uk wrote:

 Hmmm,

 Just to clarify I retested the thing using the nightly as of today
 18-jan-2009. The problem is still there and this traceback is from
 that nightly.

 This looks fine. Can you post the stack trace?
 
 Yep, here is the juicy bit. Let me know if you need more.
 
 Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
 INFO: Server startup in 2390 ms
 Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
 INFO: [janesdocs] webapp=/solr path=/dataimport
 params={command=full-import} status=0 QTime=12
 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
 readIndexerProperties
 INFO: Read dataimport.properties
 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 INFO: Starting Full Import
 Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
 deleteAll
 INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
 Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=2
 
 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
 
 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
 Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
 updateCommits
 INFO: last commit = 1232363283059
 Jan 19, 2009 11:14:06 AM
 org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
 WARNING: transformer threw error
 java.lang.NullPointerException
at java.io.StringReader.init(StringReader.java:33)
at
 org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
at
 org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
 buildDocument
 SEVERE: Exception while processing: janescurrent document : null
 org.apache.solr.handler.dataimport.DataImportHandlerException:
 java.lang.NullPointerException
at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
at
 org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
 Caused by: java.lang.NullPointerException
at java.io.StringReader.init(StringReader.java:33)
at
 org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
at
 org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
... 9 more
 Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 SEVERE: Full