For posterity, in case anybody follows this thread, I tracked the problem down to WordDelimiterFilter; apparently it creates an offset of -1 in some case, which PostingsHighlighter rejects.

-Mike


On 5/2/2014 10:20 AM, Michael Sokolov wrote:
I checked using the analysis admin page, and I believe there are offsets being generated (I assume start/end=offsets). So IDK I am going to try reindexing again. Maybe I neglected to reload the config before I indexed last time.

-Mike

On 05/02/2014 09:34 AM, Michael Sokolov wrote:
I've been wanting to try out the PostingsHighlighter, so I added storeOffsetsWithPositions to my field definition, enabled the highlighter in solrconfig.xml, reindexed and tried it out. When I issue a query I'm getting this error:

|field 'text' was indexed without offsets, cannot highlight


java.lang.IllegalArgumentException: field 'text' was indexed without offsets, 
cannot highlight
        at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
        at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
        at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
        at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
I've been trying to figure out why the field wouldn't have offsets indexed, but I just can't see it. Is there something in the analysis chain that could stripping out offsets?


This is the field definition:

<field name="text" type="text_en" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true" storeOffsetsWithPositions="true" />

(Yes I know PH doesn't require term vectors; I'm keeping them around for now while I experiment)

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
<!-- We are indexing mostly HTML so we need to ignore the tags -->
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -->
        <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" protected="protwords.txt"/>
        <!-- This deals with contractions -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" expand="true" ignoreCase="true"/> <filter class="solr.HunspellStemFilterFactory" dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -->
        <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"/> <!-- setting tokenSeparator="" solves issues with compound words and improves phrase search --> <filter class="solr.HunspellStemFilterFactory" dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


Reply via email to