Re: Schema.xml, copyField, Slash, ignoreCase ?

Steve Rowe Fri, 11 Jan 2019 08:43:11 -0800

Hi Bruno,

ignoreCase: Looks like you already have achieved this?


auto truncation: This is caused by inclusion of PorterStemFilterFactory in your 
"text_en" field type.  If you don't want its effects (i.e. treating different 
forms of the same word interchangeably), remove the filter.

process slash char: I think you want the slash to be included in symbol terms 
rather than interpreted as a term separator.  One way to achieve this is to 
first, pre-tokenization, convert the slash to a string that does not include a 
term separator, and then post-tokenization, convert the substituted string back 
to a slash.

Here's a version of your text_en that uses PatternReplaceCharFilterFactory[1] 
to convert slashes inside of symbol-ish terms (the pattern is a guess based on 
the symbol text you've provided; you'll likely need to adjust it) to "_": a 
string unlikely to otherwise occur, and which will not be interpreted by 
StandardTokenizer as a term separator; and then PatternReplaceFilterFactory[1] 
to convert "_" back to slashes.  Note that the patterns for the two are 
slightly different, since the *char filter* is given as input the entire field 
text, while the *filter* is given the text of single terms.

----- 
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" 
                replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory"
            protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="\b([A-Za-z]\d+[A-Za-z]\d+)/(\d+)\b" 
replacement="$1_$2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" 
            pattern="^([A-Za-z]\d+[A-Za-z]\d+)_(\d+)$" 
            replacement="$1/$2"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="lang/stopwords_en.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>
-----

[1] 
http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.4.pdf

--
Steve


> On Jan 11, 2019, at 4:18 AM, Bruno Mannina <bmann...@matheo-software.com> 
> wrote:
> 
> I need to have default text field with:
> 
> - ignoreCase,
> 
> - no auto truncation,
> 
> - process slash char
> 
> 
> 
> I would like to perform only query on the field text
> 
> Queries can contain:  code or keywords or both.
> 
> 
> 
> I have 2 fields named symbol and title, and 1 alias ti (old field that I
> cant delete or modify)
> 
> 
> 
> * Symbol contains code with slash (i.e A62C21/02)
> 
> <field name="symbol" type="string_ci" multiValued="false" indexed="true"
> required="true" stored="true"/>
> 
> 
> 
> * Title contains English text and also symbol
> 
>    <field name="title" type="text_en" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> 
> { "symbol": "B65D81/20",
> 
> "title": [
> 
> "under vacuum or superatmospheric pressure, or in a special atmosphere,
> e.g. of inert gas  {(B65D81/28  takes precedence; containers with
> pressurising means for maintaining ball pressure A63B39/025)} "
> 
> ]}
> 
> 
> 
> * Ti is an alias of title
> 
>    <field name="ti" type="text_general" multiValued="true" indexed="true"
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> 
> * Text is
> 
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
> 
> 
> 
> - Alias are:
> 
> 
> 
>    <copyField source="title"  dest="ti"/>
> 
>    <!-- ALIAS TEXT -->
> 
>    <copyField source="title"  dest="text"/>
> 
>    <copyField source="symbol" dest="text"/>
> 
> 
> 
> 
> 
> If I do these queries :
> 
> 
> 
> * ti:airbag                           à its ok
> 
> * title:airbag                      à not good for me because it found
> airbags
> 
> * ti:b65D81/28                  à not good, debug shows ti:b65d81 OR ti:28
> 
> * ti:b65D81/28              à its ok
> 
> * symbol:b65D81/28      à its ok (even without  )
> 
> 
> 
> NOW with text field
> 
> * b65D81/28                      à not good, debug shows text:b65d81 OR
> text:28
> 
> * airbag                               à its ok
> 
> * b65D81/28                  à its ok
> 
> 
> 
> It will be great if I can enter symbol without  
> 
> 
> 
> Could you help me to have a text field which solve this problem ? (please
> find below all def of my fields)
> 
> 
> 
> Many thanks for your help.
> 
> 
> 
> String_ci is my own definition
> 
> 
> 
>    <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
> 
>    <analyzer>
> 
>      <tokenizer class="solr.KeywordTokenizerFactory"/>
> 
>      <filter class="solr.LowerCaseFilterFactory"/>
> 
>    </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100" multiValued="true">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
>    <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_en.txt"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.EnglishPossessiveFilterFactory"/>
> 
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> 
>        <filter class="solr.PorterStemFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> 
> 
> 
> Best Regards
> 
> Bruno
> 
> 
> 
> 
> 
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus

Re: Schema.xml, copyField, Slash, ignoreCase ?

Reply via email to