Re: Exact substring search with ngrams

Erick Erickson Wed, 26 Aug 2015 09:06:11 -0700

bq: my dog
has fleas
I wouldn't  want some variant of "og ha" to match,


Here's where the mysterious "positionIncrementGap" comes in. If you
make this field "multiValued",  and index this like this:
<doc>
<field name="blah">my dog</field>
<field name="blah">has fleas</field>
<doc>

or equivalently in SolrJ just
doc.addField("blah", "my dog");
doc.addField("blah", "has fleas");

then the position of "dog" will be 2 and the position of "has" will be
102 assuming
the positionIncrementGap is the default 100. N.B. I'm not sure you'll
see this in the
admin/analysis page or not.....

Anyway, now your example won't match across the two parts unless
you specify a "slop" up in the 101 range.

Best,
Erick

On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer <r...@networkz.ch> wrote:
> On 26/08/15 00:24, Erick Erickson wrote:
>> Hmmm, this sounds like a nonsensical question, but "what do you mean
>> by arbitrary substring"?
>>
>> Because if your substrings consist of whole _tokens_, then ngramming
>> is totally unnecessary (and gets in the way). Phrase queries with no slop
>> fulfill this requirement.
>>
>> But let's assume you need to march within tokens, i.e. if the doc
>> contains "my dog has fleas", you need to match input like "as fle", in this
>> case ngramming is an option.
>
> Yeah the "as fle"-thing is exactly what I want to achieve.
>
>>
>> You have substantially different index and query time chains. The result is 
>> that
>> the offsets for all the grams at index time are the same in the quick 
>> experiment
>> I tried, all were 1. But at query time, each gram had an incremented 
>> position.
>>
>> I'd start by using the query time analysis chain for indexing also. Next, I'd
>> try enclosing multiple words in double quotes at query time and go from 
>> there.
>> What you have now is an anti-pattern in that having substantially
>> different index
>> and query time analysis chains is not something that's likely to be very
>> predictable unless you know _exactly_ what the consequences are.
>>
>> The admin/analysis page is your friend, in this case check the
>> "verbose" checkbox
>> to see what I mean.
>
> Hmm interesting. I had the additional \R tokenizer in the index chain
> because the the document can be multiple lines (but the search text is
> always a single line) and if the document was
>
> my dog
> has fleas
>
> I wouldn't want some variant of "og ha" to match, but I didn't realize
> it didn't give me any positions like you noticed.
>
> I'll try to experiment some more, thanks for the hints!
>
> Chris
>
>>
>> Best,
>> Erick
>>
>> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> wrote:
>>> Hi
>>>
>>> I'm trying to build an index for technical documents that basically
>>> works like "grep", i.e. the user gives an arbitray substring somewhere
>>> in a line of a document and the exact matches will be returned. I
>>> specifically want no stemming etc. and keep all whitespace, parentheses
>>> etc. because they might be significant. The only normalization is that
>>> the search should be case-insensitvie.
>>>
>>> I tried to achieve this by tokenizing on line breaks, and then building
>>> trigrams of the individual lines:
>>>
>>> <fieldType name="configtext_trigram" class="solr.TextField" >
>>>
>>>     <analyzer type="index">
>>>
>>>         <tokenizer class="solr.PatternTokenizerFactory"
>>>             pattern="\R" group="-1"/>
>>>
>>>         <filter class="solr.NGramFilterFactory"
>>>             minGramSize="3" maxGramSize="3"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>
>>>     </analyzer>
>>>
>>>     <analyzer type="query">
>>>
>>>         <tokenizer class="solr.NGramTokenizerFactory"
>>>             minGramSize="3" maxGramSize="3"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>
>>>     </analyzer>
>>> </fieldType>
>>>
>>> Then in the search, I use the edismax parser with mm=100%, so given the
>>> documents
>>>
>>>
>>> {"id":"test1","content":"
>>> encryption
>>> 10.0.100.22
>>> description
>>> "}
>>>
>>> {"id":"test2","content":"
>>> 10.100.0.22
>>> description
>>> "}
>>>
>>> and the query content:encryption, this will turn into
>>>
>>> "parsedquery_toString":
>>>
>>> "+((content:enc content:ncr content:cry content:ryp
>>> content:ypt content:pti content:tio content:ion)~8)",
>>>
>>> and return only the first document. All fine and dandy. But I have a
>>> problem with possible false positives. If the search is e.g.
>>>
>>> content:.100.22
>>>
>>> then the generated query will be
>>>
>>> "parsedquery_toString":
>>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
>>>
>>> and because all of tokens are also generated for document test2 in the
>>> proximity of 5, both documents will wrongly be returned.
>>>
>>> So somehow I'd need to express the query "content:.10 content:100
>>> content:00. content:0.2 content:.22" with *the tokens exactly in this
>>> order and nothing in between*. Is this somehow possible, maybe by using
>>> the termvectors/termpositions stuff? Or am I trying to do something
>>> that's fundamentally impossible? Other good ideas how to achieve this
>>> kind of behaviour?
>>>
>>> Thanks
>>> Christian
>>>
>>>
>>>
>

Re: Exact substring search with ngrams

Reply via email to