Re: edge ngram/find as you type sorting

Erick Erickson Wed, 25 Mar 2020 10:16:19 -0700

What _is_ happening? Please provide examples of the inputs
and outputs that don’t work for you. ‘cause
the sort order should be “nothing comes before something"
so sorting ascending on a keywordtokenizer+lowecasefilter
should give you exactly what you’re asking for with no
need for a length field.


Best,
Erick

> On Mar 25, 2020, at 11:07 AM, matthew sporleder <msporle...@gmail.com> wrote:
> 
> My original goal was to avoid indexing the string length because I
> wanted edge ngram to "score" based on how "exact" the match was:
> 
> q=abc
> "abc" has a high score
> "abcd" has a lower score
> "abcde" has an even lower score
> 
> You say sorting by by the original field will do that but in practice
> it is not happening so I am probably missing something.
> 
> I *am* getting a close version of what I said above with sorting on
> the length, which I added to the index.
> 
> searching for my keyword-lowercase field:abc* + sorting by length is
> also working so maybe I can skip the edge ngram field entirely and
> just do that but I was hoping the trade some disk space for
> performance.  This field will get queried a lot.
> 
> 
> On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> 
>> Why do you want to deal with score at all? Sorting
>> overrides score-based sorting. Well, unless you
>> specify score as a secondary sort. But since you’re
>> sorting by length anyway, trying to score
>> based on proximity to the end does nothing.
>> 
>> The weirdness you’re going to get here, though, is
>> that the order of the results will not be alphabetical.
>> Say you have two docs, one with abcd and one with
>> abce. Now say you search on abc. Whether abcd or
>> abce comes first is indeterminant.
>> 
>> If you simply stored the keyword-lowercased value
>> in a copyfield and sorted on _that_, you wouldn’t have
>> this problem. But if you’re really worried about space,
>> that might not be an option.
>> 
>> Best,
>> Erick
>> 
>>> On Mar 25, 2020, at 9:49 AM, matthew sporleder <msporle...@gmail.com> wrote:
>>> 
>>> Where I landed:
>>> 
>>> <fieldType name="string_ci" class="solr.TextField"
>>> sortMissingLast="true" omitNorms="false">
>>>    <analyzer>
>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>    </analyzer>
>>> </fieldType>
>>> 
>>> <fieldType name="edgytext" class="solr.TextField" 
>>> positionIncrementGap="100">
>>> <analyzer type="index">
>>>  <filter class="solr.LowerCaseFilterFactory" />
>>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" />
>>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>> </analyzer>
>>> <analyzer type="query">
>>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>> 
>>> 
>>> <field name="slug" type="string_ci" indexed="true" stored="true"
>>> multiValued="false" />
>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
>>> omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
>>> />
>>> <field name="qt_len" type="int" indexed="true" stored="true"
>>> multiValued="false" />
>>> 
>>> ---
>>> 
>>> I can then do a search for
>>> 
>>> q=fayt:my_article_slu&sort=qt_len asc
>>> 
>>> to get the shortest/most exact find-as-you-type match.  I couldn't get
>>> around all results having the same score (can I boost proximity to the
>>> end of a string?) in the edge ngram search but I am hoping this is the
>>> fastest way to do this type of search since I can avoid wildcards
>>> "my_article_slu*" and stuff.
>>> 
>>> More suggestions welcome and thanks for the help.  I will re-index
>>> with omitNorms=true again to see if I can save a little space.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <msporle...@gmail.com> 
>>> wrote:
>>>> 
>>>> Okay I appreciate you responding.
>>>> 
>>>> Switching "slug" from "string_ci" class="solr.StrField" accomplished
>>>> about the same results, which makes sense to me now :)
>>>> 
>>>> The previous definition of string_ci was:
>>>> <fieldType name="string_ci" class="solr.TextField"
>>>> sortMissingLast="true" omitNorms="true">
>>>>    <analyzer>
>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>>    </analyzer>
>>>> </fieldType>
>>>> 
>>>> So lowercase + KeywordTokenizerFactory;
>>>> 
>>>> I am trying again with omitNorms=false  to see if I can get the more
>>>> "exact" matches to score better this time around.
>>>> 
>>>> 
>>>> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <erickerick...@gmail.com> 
>>>> wrote:
>>>>> 
>>>>> Won’t work. String types are totally unanalyzed. Your string_ci fieldType 
>>>>> is what I was looking for.
>>>>> 
>>>>> No, you shouldn’t kill the lowercasefilter unless you want all of your 
>>>>> searches will then be case-sensitive.
>>>>> 
>>>>> So you should try:
>>>>> 
>>>>> q=edgy_text:whatever&sort=string_ci asc
>>>>> 
>>>>> Please use the admin>>pick_core>>analysis page when thinking about 
>>>>> changing your schema, it’ll answer a _lot_ of these questions immediately.
>>>>> 
>>>>> Best,
>>>>> Erick
>>>>> 
>>>>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <msporle...@gmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> Oh maybe a schema bug!
>>>>>> 
>>>>>> my string_ci:
>>>>>> <fieldType name="string_ci" class="solr.TextField"
>>>>>> sortMissingLast="true" omitNorms="true">
>>>>>>   <analyzer>
>>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>        <filter class="solr.LowerCaseFilterFactory" />
>>>>>>   </analyzer>
>>>>>> </fieldType>
>>>>>> 
>>>>>> going to try this instead:
>>>>>> <fieldType name="string_lctoken" class="solr.StrField"
>>>>>> sortMissingLast="true" omitNorms="true">
>>>>>>   <analyzer>
>>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>        <filter class="solr.LowerCaseFilterFactory" />
>>>>>>   </analyzer>
>>>>>> </fieldType>
>>>>>> 
>>>>>> Then I can probably kill the lowercasefilter on edgeytext:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <erickerick...@gmail.com> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> Sort by the full field. You’ll need to copy to a field with 
>>>>>>> keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not 
>>>>>>> really a :”string”) type.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Erick
>>>>>>> 
>>>>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <msporle...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I have added an edge ngram field to my index and get decent results
>>>>>>>> with partial words but the results appear randomly sorted and all
>>>>>>>> contain the same score.  Ideally I would like to sort by shortest
>>>>>>>> ngram match within my other qualifiers.
>>>>>>>> 
>>>>>>>> Is there a canonical solution to this?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Matt
>>>>>>>> 
>>>>>>>> p.s. I mostly followed
>>>>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
>>>>>>>> 
>>>>>>>> schema bits:
>>>>>>>> 
>>>>>>>> <fieldType name="edgytext" class="solr.TextField" 
>>>>>>>> positionIncrementGap="100">
>>>>>>>> <analyzer type="index">
>>>>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>>>>>>> maxGramSize="25" />
>>>>>>>> </analyzer>
>>>>>>>> 
>>>>>>>> <field name="slug" type="string_ci" indexed="true" stored="true"
>>>>>>>> multiValued="false" />
>>>>>>>> 
>>>>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
>>>>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
>>>>>>>> />
>>>>>>>> 
>>>>>>>> 
>>>>>>>> <copyField source="slug" dest="fayt" maxChars="65" />
>>>>>>> 
>>>>> 
>>

Re: edge ngram/find as you type sorting

Reply via email to