Thanks a lot, Michael. See replies below.
> Am 25.03.2015 um 21:41 schrieb Michael Della Bitta > <michael.della.bi...@appinions.com>: > > Two other things I noticed: > > 1. You probably don't want to store your copyFields. That's literally going > to be the same information each time. OK, got it. I have set the targets of the copy fields to store=„false“. > > 2. Your expectation "the pre-processed version of the text is added to the > index" may be incorrect. Anything done in <analyzer type="query"> sections > actually happens at query time. Not sure if that's significant for you. I was actually referring to what is happening at index time. So, the pre-processing steps are applied under <analyzer type=„index“>. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? (sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). Cheers, Martin > > > Michael Della Bitta > > Senior Software Engineer > > o: +1 646 532 3062 > > appinions inc. > > “The Science of Influence Marketing” > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions > <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> > w: appinions.com <http://www.appinions.com/> > > On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan <iori...@yahoo.com.invalid> > wrote: > >> Hi Martin, >> >> fq means filter query. May be you want to use qf (query fields) parameter >> of edismax? >> >> >> >> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <martin...@gmx.net> >> wrote: >> Hi all, >> >> I am wondering what the process is for applying Tokenizers and Filter (as >> defined in the FieldType definition) to field contents that result from >> CopyFields. To be more specific, in my Solr instance, Iwould like to >> support query expansion by two means: removing stop words and adding >> inflected word forms as synonyms. >> >> To use a specific example, let’s say I have the following sentence to be >> indexed (from a Wittgenstein manuscript): >> >> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“ >> >> >> This sentence will be indexed in a field called „original“ that is defined >> as follows: >> >> <field name="original" type="text_original" indexed="true" stored="true" >> required="true“/> >> >> <fieldType name="text_windex_original" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> </analyzer> >> </fieldType> >> >> >> Then, in order to create fields for the two types of query expansion, I >> have set up specific fields for this: >> >> - one field where stopwords are removed both on the indexed content and >> the query. So, if the users is searching for a phrase like „der Sprache“, >> Solr should still find the segment above, because the determiners („der“ >> and „die“) are removed prior to indexing and prior to querying, >> respectively. This field is defined as follows: >> >> <field name="stopwords_removed" type="text_stopwords_removed" >> indexed="true" stored="true" required="true“/> >> >> <fieldType name="text_stopwords_removed" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words=„stopwords_de.txt" format="snowball"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords_de.txt" format="snowball"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> </analyzer> >> </fieldType> >> >> >> - a second field where synonyms are added to the query so that more >> segments will be found. For instance, if the user is searching for the >> plural form „Sprachen“, Solr should return the segment above, due to this >> entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is >> defined as follows: >> >> <field name="expanded" type="text_multiplied" indexed="true" stored="true" >> required="true“/>expanded >> >> <fieldType name="text_expanded" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords_de.txt" format="snowball"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords_de.txt" format="snowball"/> >> <filter class="solr.SynonymFilterFactory" >> synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> </analyzer> >> </fieldType> >> >> Finally, to avoid having to specify three fields with identical content in >> the import documents, I am defining the two fields for query expansion as >> copyFields: >> >> <copyField source="original" dest="stopwords_removed"/> >> <copyField source="original" dest="expanded“/> >> >> Now, my expectation would be as follows: >> - during import, two temporary fields are created by copying content from >> the original field >> - these two temporary fields are then pre-processed as per the definitions >> above >> - the pre-processed version of the text is added to the index >> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der >> Sprache“ and will always get the segment above as a matching result. >> >> However, what happens actually is that I get matches only for „Sprache“ >> and „sprache“. >> >> The other thing that strikes as odd, is that when I restrict the search to >> one of the fields only using the „fq“ parameter, I get no results. For >> instance: >> >> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true >> < >> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true >>> >> >> will return no matches. I would expected that using the fq parameter the >> user can specify what type of search (s)he would like to carry out: A >> standard search (field original) or an expanded search (one of the other >> two fields). >> >> For debugging, I have checked the analysis and results seem ok (posted >> below). >> Apologies for the long post, but I am really a bit stuck here (even after >> doing a lot of reading and googling). It is probably something simple that >> I missing. >> Thanks a lot in advance for any help. >> >> Cheers, >> >> Martin >> >> >> ST >> Was >> zum >> Wesen >> >> der >> Welt >> gehört >> kann >> die >> Sprache >> nicht >> ausdrücken >> SF >> Was >> zum >> Wesen >> >> Welt >> gehört >> kann >> die >> Sprache >> nicht >> ausdrücken >> LCF >> was >> zum >> wesen >> >> welt >> gehört >> kann >> die >> sprache >> nicht >> ausdrücken >>