Applying Tokenizers and Filters to CopyFields

Martin Wunderlich Wed, 25 Mar 2015 12:24:07 -0700

Hi all, 

I am wondering what the process is for applying Tokenizers and Filter (as 
defined in the FieldType definition) to field contents that result from 
CopyFields. To be more specific, in my Solr instance, Iwould like to support 
query expansion by two means: removing stop words and adding inflected word 
forms as synonyms.


To use a specific example, let’s say I have the following sentence to be 
indexed (from a Wittgenstein manuscript): 

"Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“


This sentence will be indexed in a field called „original“ that is defined as 
follows: 

<field name="original" type="text_original" indexed="true" stored="true" 
required="true“/>

    <fieldType name="text_windex_original" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
    </fieldType>


Then, in order to create fields for the two types of query expansion, I have 
set up specific fields for this: 

- one field where stopwords are removed both on the indexed content and the 
query. So, if the users is searching for a phrase like „der Sprache“, Solr 
should still find the segment above, because the determiners („der“ and „die“) 
are removed prior to indexing and prior to querying, respectively. This field 
is defined as follows: 

<field name="stopwords_removed" type="text_stopwords_removed" indexed="true" 
stored="true" required="true“/>

    <fieldType name="text_stopwords_removed" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words=„stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


- a second field where synonyms are added to the query so that more segments 
will be found. For instance, if the user is searching for the plural form 
„Sprachen“, Solr should return the segment above, due to this entry in the 
synonyms file: "Sprache,Sprach,Sprachen“. This field is defined as follows: 

<field name="expanded" type="text_multiplied" indexed="true" stored="true" 
required="true“/>expanded

    <fieldType name="text_expanded" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Finally, to avoid having to specify three fields with identical content in the 
import documents, I am defining the two fields for query expansion as 
copyFields: 

  <copyField source="original" dest="stopwords_removed"/>
  <copyField source="original" dest="expanded“/>

Now, my expectation would be as follows: 
- during import, two temporary fields are created by copying content from the 
original field
- these two temporary fields are then pre-processed as per the definitions above
- the pre-processed version of the text is added to the index
- then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der 
Sprache“ and will always get the segment above as a matching result. 

However, what happens actually is that I get matches only for „Sprache“ and 
„sprache“. 

The other thing that strikes as odd, is that when I restrict the search to one 
of the fields only using the „fq“ parameter, I get no results. For instance: 
http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
 
<http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true>

will return no matches. I would expected that using the fq parameter the user 
can specify what type of search (s)he would like to carry out: A standard 
search (field original) or an expanded search (one of the other two fields). 

For debugging, I have checked the analysis and results seem ok (posted below). 
Apologies for the long post, but I am really a bit stuck here (even after doing 
a lot of reading and googling). It is probably something simple that I missing. 
Thanks a lot in advance for any help. 

Cheers, 

Martin
 

ST
Was
zum
Wesen
der
Welt
gehört
kann
die
Sprache
nicht
ausdrücken
SF
Was
zum
Wesen
 
Welt
gehört
kann
die
Sprache
nicht
ausdrücken
LCF
was
zum
wesen
 
welt
gehört
kann
die
sprache
nicht
ausdrücken

Applying Tokenizers and Filters to CopyFields

Reply via email to