Hi, I have a Problem when using PatternReplaceCharFilter when indexing a field. I created the following field: <fieldType name="testfield" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="<TextDocument[^>]*>" replacement=""/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="</TextDocument>" replacement=""/>--> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="<TextLine[^<]+ content=\"([^\"]*)\"[^/]+/>" replacement="$1 "/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
And I created a field that is indexed and stored: <field name="testfield" type="testfield" indexed="true" stored="true" /> I need to index a document with such a structure in this field: <TextDocument filename="somefile.end" mime="..." created="..."><TextLine aa="bb" cc="dd" content="the content to search in" ee="ff" /><TextLine aa="bb" cc="dd" content="the second content line" ee="ff" /></TextDocument> Basically I have some sort of XML structure, i need only to search in the "content" attribute, but when highlighting i need to get back to the enclosing XML tags. So with the 3 Regex I want to remove all unwanted tags and tokenize/index only the important data. I know that I could use HTMLStripCharFilterFactory but then also the tag names, attribute names and values get indexed. And I don't want to search in that content too. I read the following in the doc: NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble. The thing is, why is this the case? When running the analyze from solr admin the CharFilters generate "the content to search in the second content line" which looks perfect, but then the StandardTokenizer gets the start and end positions of the tokens wrong. Why is this the case? Does there exist another solution to my problem? Could I use the following method I saw in the doc of PatternReplaceCharFilter: protected int correct(int currentOff) Documentation: Retrieve the corrected offset. How could I solve such a task? -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html Sent from the Solr - User mailing list archive at Nabble.com.