During indexing I will often get this error: SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 3)) at [row,col {unknown-source}]: [2,1] at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
By looking at this list and elsewhere I know that I need to filter out most control characters so I have been employing this regex: /[\x00-\x08\x0B\x0C\x0E-\x1F]/ But I still get the error. What is strange is that if I re-run my indexing process after a failure it will work on the previously failed node and then error out on another node some time later. That is, it is not deterministic. If I look at the text that is attempted to be indexed its pure as you can get one (a bunch of medical keywords like "leg bones" and "nose"). Any ideas would be greatly appreciated. The platform is: Solr implementation version: 1.3.0 694707 Lucene implementation version: 2.4-dev 691741 Mac OS X 10.5.7 JVM 1.5.0_19-b02-304 Thanks /Rupert