Problem with PatternReplaceCharFilter

jasimop Wed, 29 May 2013 13:13:18 -0700

Hi,

I have a Problem when using PatternReplaceCharFilter when indexing a field.
I created the following field: 
    <fieldType name="testfield" class="solr.TextField">
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;TextDocument[^&#62;]*&#62;" replacement=""/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;/TextDocument&#62;" replacement=""/>-->
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&#60;TextLine[^&#60;]+ content=\&#34;([^\&#34;]*)\&#34;[^/]+/&#62;"
replacement="$1 "/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt"  format="snowball"
enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


And I created a field that is indexed and stored:
<field name="testfield" type="testfield" indexed="true" stored="true" />

I need to index a document with such a structure in this field:
<TextDocument filename="somefile.end" mime="..." created="..."><TextLine
aa="bb" cc="dd" content="the content to search in" ee="ff" /><TextLine
aa="bb" cc="dd" content="the second content line" ee="ff" /></TextDocument>

Basically I have some sort of XML structure, i need only to search in the
"content" attribute, but when highlighting i need to get back to the
enclosing XML tags.

So with the 3 Regex I want to remove all unwanted tags and tokenize/index
only the important data.
I know that I could use HTMLStripCharFilterFactory but then also the tag
names, attribute names and values get indexed. And I don't want to search in
that content too.

I read the following in the doc:
NOTE: If you produce a phrase that has different length to source string and
the field is used for highlighting for a term of the phrase, you will face a
trouble. 

The thing is, why is this the case? When running the analyze from solr admin
the CharFilters generate
"the content to search in the second content line" which looks perfect, but
then the StandardTokenizer
gets the start and end positions of the tokens wrong. Why is this the case?
Does there exist another solution to my problem?
Could I use the following method I saw in the doc of
PatternReplaceCharFilter:
protected int correct(int currentOff) Documentation: Retrieve the corrected
offset.

How could I solve such a task?






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-PatternReplaceCharFilter-tp4066869.html
Sent from the Solr - User mailing list archive at Nabble.com.

Problem with PatternReplaceCharFilter

Reply via email to