Re: question regarding wildcard-searches

Erick Erickson Mon, 19 Mar 2018 11:36:12 -0700

right, I put in the bit about multiValued just in case. We can ignore it.

You're about to dive into the arcana of analysis chains. Be prepared
to spend some time there ;)


 Here's the issue:

StandardTokenizer breaks up the input stream according to a set of
rules, a summary is here:
https://lucene.apache.org/solr/guide/6_6/tokenizers.html.

You have to work backwards from what a "google like" search means.
You've defined one use-case, roughly:
"We want to support prefix searches on these kinds of values: 'EO.1954.53.2' "

You simply can't do that with StandardTokenizer as you've seen. You
have _two separate_ tokens in the index,
"EO" and "1954.53.2". There's no syntax that allows you to search for,
say, "EO.19*", that pre-supposes a single
token. (I'm cheating a little here, technically you _could_ do
something like this with ComplexPhraseQueryParser,
but I think you need a higher-level view first)....

So what to do? Well, there are several possibilities:

1> create a separate field (string type) and use the edismax query
parser to automatically search against this
new field. You'd copy from the field of interest to this new field and
analyze it differently to support this use-case.

2> Add WordDelimiter(Graph)FilterFactory to your analysis chain, _and_
change the tokenizer to something else,
perhaps WhitespaceTokenizerFactory.  NOTE: there are a bunch of
options in WD(G)FF
to do things like break on punctuation, squash everything together,
etc. So, depending on the options your index
could contain any or all of the following _single_ tokens:
EO
1954
53
2
1954532
EO.1954.53.2

now lots of things are possible. BUT note that at _query_ time (the
above was at index time) the tokens
generated to _search_ on may be different if you specify different
options. In fact, they really should be. You
may not want WD(G)FF in the _query_ time chain at all.

And this is just the beginning. Here's a _partial_ list of the various
tokenizers and filters (and charfilters)
that exist, 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

<1> above is very common as a solution. Not necessarily exactly "copy
to a stringField", but the concept
of copying some fields to another field that's analyzed differently in
order to support a particular use-case.

edismax lets you take the text entered in the search box and
automagically search across a number of
fields explicitly to support the "Google-like" search, where
"Google-like" is defined as "type a bunch of
stuff in a single entryfield and just search, man". The "just search"
bit is where all the ugly details of
1> figuring out the cases you need to support
and
2> figuring out the tricks you have to play under the covers to make a
simple search for the user

And you're finding out all about that now ;)

BTW, edismax is defined in solronfig.xml and is a specific query parser, see:
https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html

Best,
Erick

On Mon, Mar 19, 2018 at 9:06 AM, Paesen Roel
<roel.pae...@africamuseum.be> wrote:
> Hi,
>
> The goal is to provide a google-like search field for our databases, (one 
> simple searchfield on a webpage) that is why we copy everything into the 
> _text_ field, so that everything is searchable. (is there a better way to 
> achieve something like this?)
>
> I should have been more clear before, but the different numbers I gave as 
> example are all different solr-documents, with only 1 number per 
> solr-document, so there is no need (for this field) to be multi-valued. Sorry 
> about that.
>
> Here is my text_general definition (which is a direct copy from the 
> DIH-example that comes with solr 7.2.1):
> -----------8<------------
>     <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymGraphFilterFactory" 
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>         <filter class="solr.SynonymGraphFilterFactory" 
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> -----------8<------------
>
> In the analysis screen, I see that indeed the text gets broken down to 'EO' 
> (alphanumeric), and '1954.53.1' (numeric).
> Searching without wildcard also returns zero results...
>
> As I mentioned before: we are testing this all, so we are not really up to 
> speed with the why-does-this-do-that, although I am trying to learn.
>
> Thanks for any other pointers you can provide.
> Greetings,
> Roel
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: vrijdag 16 maart 2018 17:08
> To: solr-user
> Subject: Re: question regarding wildcard-searches
>
> If you goal is to search prefixes only, I'd go away from the _text_ field all 
> together and use a "string" type. This will mean you need to
> 1> make it multiValued=true
> 2> split this up (either on your client or use a
> FieldMutatingUpdateProcessor, probably RegexReplaceProcessorFactory) into 
> separate entries, i.e.
> 'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3'
> becomes three separate entries in the field 'EO.1954.53.1'
> 'EO.1954.53.2'
> 'EO.1954.53.3'
>
> At that point, searches like: 'EO.1954.53.*'
>
> will work just fine. NOTE: String types do zero analysis, so you have to 
> handle things like casing yourself. That is, 'eO.1954.53.*' would _not_ 
> match. You can probably use something like KeywordTokenizerFactory + 
> LowerCaseFilterFactory in that case.
>
> All this makes _much_ more sense if you use the admin UI>>analysis page 
> (probably uncheck the "verbose" checkbox, there'll be less clutter").
>
> Best,
> Erick
>
> On Fri, Mar 16, 2018 at 8:35 AM, Emir Arnautović 
> <emir.arnauto...@sematext.com> wrote:
>> Hi Roel,
>> As mentioned, _text_ field probably does not contain complete “EO.1954.53.1” 
>> but only its parts. You can verify that using snalysis screen in admin 
>> console. What you can try is searching for phrase without wildcard 
>> “EO.1954.53” or if you are using WordDelimiterTokenFilter in your analysis 
>> chain, you can set preserveOriginal=“1” and reindex.
>>
>> Can you share how your text_general looks like.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
>> Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 16 Mar 2018, at 14:05, Paesen Roel <roel.pae...@africamuseum.be> wrote:
>>>
>>> Hi,
>>>
>>> Unfortunately that also gives no results (and it would not be
>>> practical, as for this example the numbering only goes up till 19 but
>>> others go up into the thousands etc)
>>>
>>> Anybody with a pointer on this?
>>>
>>> Thanks already,
>>> Roel
>>>
>>>
>>> -----Original Message-----
>>> From: jagdish vasani [mailto:jagdisht.vas...@gmail.com]
>>> Sent: vrijdag 16 maart 2018 12:41
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: question regarding wildcard-searches
>>>
>>> Hi paesen,
>>>
>>> Value - EO.1954.53.1 is indexed as below Eo
>>> 1954
>>> 53
>>> 1
>>> Dot is removed.try with wildcard -?
>>> Like EO.1954.53.?? If you have 2 digits only in last..
>>>
>>> I have not tried but you just check it.
>>> Hope it will solve your problem.
>>>
>>> Thanks,
>>> Jagdish
>>> On 16-Mar-2018 3:51 pm, "Paesen Roel" <roel.pae...@africamuseum.be> wrote:
>>>
>>>> Hi everybody,
>>>>
>>>> We are experimenting with solr, and I have a (I think) basic-level
>>>> question:
>>>> we have a multiple fields, all copied into a generic field so we can
>>>> search everything at once.
>>>> However we have a (for us) strange situation doing wildcard searches
>>>> for the contents of one specific field.
>>>>
>>>> Given in the schema:
>>>>
>>>> <field name="_text_" type="text_general" indexed="true" stored="false"
>>>> multiValued="true"/>
>>>>
>>>> <field name="genormaliseerdInventarisnummer" type="string" indexed="true"
>>>> stored="true"/>
>>>> <copyField source="genormaliseerdInventarisnummer" dest="_text_" />
>>>> and lot of other fields exactly like 'genormaliseerdInventarisnummer'.
>>>>
>>>>
>>>> Now, we are certain that the field 'genormaliseerdInventarisnummer'
>>>> contains entries like 'EO.1954.53.1', 'EO.1954.53.2', EO.1954.53.3',
>>>> all the way up to '.19', we can query these directly by passing
>>>> these exact texts to the query on field '_text_' (our default search 
>>>> field).
>>>> Problem is: wildcard searches for these don't work, like 'EO.1954.53.*'
>>>> for example returns zero results.
>>>>
>>>> Why is that?
>>>> What needs to be adjusted? (and how?)
>>>>
>>>> Thanks already,
>>>> Roel
>>>>
>>>>
>>

Re: question regarding wildcard-searches

Reply via email to