Re: Applying Tokenizers and Filters to CopyFields

Martin Wunderlich Wed, 25 Mar 2015 14:29:18 -0700

Thanks a lot, Michael. See replies below.


> Am 25.03.2015 um 21:41 schrieb Michael Della Bitta 
> <michael.della.bi...@appinions.com>:
> 
> Two other things I noticed:
> 
> 1. You probably don't want to store your copyFields. That's literally going
> to be the same information each time.

OK, got it. I have set the targets of the copy fields to store=„false“. 

> 
> 2. Your expectation "the pre-processed version of the text is added to the
> index" may be incorrect. Anything done in <analyzer type="query"> sections
> actually happens at query time. Not sure if that's significant for you.

I was actually referring to what is happening at index time. So, the 
pre-processing steps are applied under <analyzer type=„index“>. And this point 
is not quite clear to me: Assuming that I have a simple case-folding step 
applied to the target of the copyField: How or where are the lower-case tokens 
stored, if the text isn’t added to the index? How is the query supposed to 
retrieve the lower-case version? 
(sorry, if this sounds like a naive question, but I have a feeling that I am 
missing something really basic here). 

Cheers, 

Martin
 

> 
> 
> Michael Della Bitta
> 
> Senior Software Engineer
> 
> o: +1 646 532 3062
> 
> appinions inc.
> 
> “The Science of Influence Marketing”
> 
> 18 East 41st Street
> 
> New York, NY 10017
> 
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
> 
> On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
> wrote:
> 
>> Hi Martin,
>> 
>> fq means filter query. May be you want to use qf (query fields) parameter
>> of edismax?
>> 
>> 
>> 
>> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <martin...@gmx.net>
>> wrote:
>> Hi all,
>> 
>> I am wondering what the process is for applying Tokenizers and Filter (as
>> defined in the FieldType definition) to field contents that result from
>> CopyFields. To be more specific, in my Solr instance, Iwould like to
>> support query expansion by two means: removing stop words and adding
>> inflected word forms as synonyms.
>> 
>> To use a specific example, let’s say I have the following sentence to be
>> indexed (from a Wittgenstein manuscript):
>> 
>> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
>> 
>> 
>> This sentence will be indexed in a field called „original“ that is defined
>> as follows:
>> 
>> <field name="original" type="text_original" indexed="true" stored="true"
>> required="true“/>
>> 
>>    <fieldType name="text_windex_original" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> Then, in order to create fields for the two types of query expansion, I
>> have set up specific fields for this:
>> 
>> - one field where stopwords are removed both on the indexed content and
>> the query. So, if the users is searching for a phrase like „der Sprache“,
>> Solr should still find the segment above, because the determiners („der“
>> and „die“) are removed prior to indexing and prior to querying,
>> respectively. This field is defined as follows:
>> 
>> <field name="stopwords_removed" type="text_stopwords_removed"
>> indexed="true" stored="true" required="true“/>
>> 
>>    <fieldType name="text_stopwords_removed" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words=„stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> - a second field where synonyms are added to the query so that more
>> segments will be found. For instance, if the user is searching for the
>> plural form „Sprachen“, Solr should return the segment above, due to this
>> entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is
>> defined as follows:
>> 
>> <field name="expanded" type="text_multiplied" indexed="true" stored="true"
>> required="true“/>expanded
>> 
>>    <fieldType name="text_expanded" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> Finally, to avoid having to specify three fields with identical content in
>> the import documents, I am defining the two fields for query expansion as
>> copyFields:
>> 
>>  <copyField source="original" dest="stopwords_removed"/>
>>  <copyField source="original" dest="expanded“/>
>> 
>> Now, my expectation would be as follows:
>> - during import, two temporary fields are created by copying content from
>> the original field
>> - these two temporary fields are then pre-processed as per the definitions
>> above
>> - the pre-processed version of the text is added to the index
>> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der
>> Sprache“ and will always get the segment above as a matching result.
>> 
>> However, what happens actually is that I get matches only for „Sprache“
>> and „sprache“.
>> 
>> The other thing that strikes as odd, is that when I restrict the search to
>> one of the fields only using the „fq“ parameter, I get no results. For
>> instance:
>> 
>> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
>> <
>> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
>>> 
>> 
>> will return no matches. I would expected that using the fq parameter the
>> user can specify what type of search (s)he would like to carry out: A
>> standard search (field original) or an expanded search (one of the other
>> two fields).
>> 
>> For debugging, I have checked the analysis and results seem ok (posted
>> below).
>> Apologies for the long post, but I am really a bit stuck here (even after
>> doing a lot of reading and googling). It is probably something simple that
>> I missing.
>> Thanks a lot in advance for any help.
>> 
>> Cheers,
>> 
>> Martin
>> 
>> 
>> ST
>> Was
>> zum
>> Wesen
>> 
>> der
>> Welt
>> gehört
>> kann
>> die
>> Sprache
>> nicht
>> ausdrücken
>> SF
>> Was
>> zum
>> Wesen
>> 
>> Welt
>> gehört
>> kann
>> die
>> Sprache
>> nicht
>> ausdrücken
>> LCF
>> was
>> zum
>> wesen
>> 
>> welt
>> gehört
>> kann
>> die
>> sprache
>> nicht
>> ausdrücken
>>

Re: Applying Tokenizers and Filters to CopyFields

Reply via email to