Re: Applying Tokenizers and Filters to CopyFields

Michael Della Bitta Wed, 25 Mar 2015 13:45:15 -0700

Two other things I noticed:

1. You probably don't want to store your copyFields. That's literally going
to be the same information each time.


2. Your expectation "the pre-processed version of the text is added to the
index" may be incorrect. Anything done in <analyzer type="query"> sections
actually happens at query time. Not sure if that's significant for you.


Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi Martin,
>
> fq means filter query. May be you want to use qf (query fields) parameter
> of edismax?
>
>
>
> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <martin...@gmx.net>
> wrote:
> Hi all,
>
> I am wondering what the process is for applying Tokenizers and Filter (as
> defined in the FieldType definition) to field contents that result from
> CopyFields. To be more specific, in my Solr instance, Iwould like to
> support query expansion by two means: removing stop words and adding
> inflected word forms as synonyms.
>
> To use a specific example, let’s say I have the following sentence to be
> indexed (from a Wittgenstein manuscript):
>
> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
>
>
> This sentence will be indexed in a field called „original“ that is defined
> as follows:
>
> <field name="original" type="text_original" indexed="true" stored="true"
> required="true“/>
>
>     <fieldType name="text_windex_original" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>       </analyzer>
>     </fieldType>
>
>
> Then, in order to create fields for the two types of query expansion, I
> have set up specific fields for this:
>
> - one field where stopwords are removed both on the indexed content and
> the query. So, if the users is searching for a phrase like „der Sprache“,
> Solr should still find the segment above, because the determiners („der“
> and „die“) are removed prior to indexing and prior to querying,
> respectively. This field is defined as follows:
>
> <field name="stopwords_removed" type="text_stopwords_removed"
> indexed="true" stored="true" required="true“/>
>
>     <fieldType name="text_stopwords_removed" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words=„stopwords_de.txt" format="snowball"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" format="snowball"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
>
> - a second field where synonyms are added to the query so that more
> segments will be found. For instance, if the user is searching for the
> plural form „Sprachen“, Solr should return the segment above, due to this
> entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is
> defined as follows:
>
> <field name="expanded" type="text_multiplied" indexed="true" stored="true"
> required="true“/>expanded
>
>     <fieldType name="text_expanded" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" format="snowball"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" format="snowball"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> Finally, to avoid having to specify three fields with identical content in
> the import documents, I am defining the two fields for query expansion as
> copyFields:
>
>   <copyField source="original" dest="stopwords_removed"/>
>   <copyField source="original" dest="expanded“/>
>
> Now, my expectation would be as follows:
> - during import, two temporary fields are created by copying content from
> the original field
> - these two temporary fields are then pre-processed as per the definitions
> above
> - the pre-processed version of the text is added to the index
> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der
> Sprache“ and will always get the segment above as a matching result.
>
> However, what happens actually is that I get matches only for „Sprache“
> and „sprache“.
>
> The other thing that strikes as odd, is that when I restrict the search to
> one of the fields only using the „fq“ parameter, I get no results. For
> instance:
>
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> <
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> >
>
> will return no matches. I would expected that using the fq parameter the
> user can specify what type of search (s)he would like to carry out: A
> standard search (field original) or an expanded search (one of the other
> two fields).
>
> For debugging, I have checked the analysis and results seem ok (posted
> below).
> Apologies for the long post, but I am really a bit stuck here (even after
> doing a lot of reading and googling). It is probably something simple that
> I missing.
> Thanks a lot in advance for any help.
>
> Cheers,
>
> Martin
>
>
> ST
> Was
> zum
> Wesen
>
> der
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> SF
> Was
> zum
> Wesen
>
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> LCF
> was
> zum
> wesen
>
> welt
> gehört
> kann
> die
> sprache
> nicht
> ausdrücken
>

Re: Applying Tokenizers and Filters to CopyFields

Reply via email to