Re: Exact substring search with ngrams

Christian Ramseyer Wed, 26 Aug 2015 02:19:38 -0700

On 26/08/15 00:24, Erick Erickson wrote:
> Hmmm, this sounds like a nonsensical question, but "what do you mean
> by arbitrary substring"?
> 
> Because if your substrings consist of whole _tokens_, then ngramming
> is totally unnecessary (and gets in the way). Phrase queries with no slop
> fulfill this requirement.
> 
> But let's assume you need to march within tokens, i.e. if the doc
> contains "my dog has fleas", you need to match input like "as fle", in this
> case ngramming is an option.


Yeah the "as fle"-thing is exactly what I want to achieve.

> 
> You have substantially different index and query time chains. The result is 
> that
> the offsets for all the grams at index time are the same in the quick 
> experiment
> I tried, all were 1. But at query time, each gram had an incremented position.
> 
> I'd start by using the query time analysis chain for indexing also. Next, I'd
> try enclosing multiple words in double quotes at query time and go from there.
> What you have now is an anti-pattern in that having substantially
> different index
> and query time analysis chains is not something that's likely to be very
> predictable unless you know _exactly_ what the consequences are.
> 
> The admin/analysis page is your friend, in this case check the
> "verbose" checkbox
> to see what I mean.

Hmm interesting. I had the additional \R tokenizer in the index chain
because the the document can be multiple lines (but the search text is
always a single line) and if the document was

my dog
has fleas

I wouldn't want some variant of "og ha" to match, but I didn't realize
it didn't give me any positions like you noticed.

I'll try to experiment some more, thanks for the hints!

Chris

> 
> Best,
> Erick
> 
> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> wrote:
>> Hi
>>
>> I'm trying to build an index for technical documents that basically
>> works like "grep", i.e. the user gives an arbitray substring somewhere
>> in a line of a document and the exact matches will be returned. I
>> specifically want no stemming etc. and keep all whitespace, parentheses
>> etc. because they might be significant. The only normalization is that
>> the search should be case-insensitvie.
>>
>> I tried to achieve this by tokenizing on line breaks, and then building
>> trigrams of the individual lines:
>>
>> <fieldType name="configtext_trigram" class="solr.TextField" >
>>
>>     <analyzer type="index">
>>
>>         <tokenizer class="solr.PatternTokenizerFactory"
>>             pattern="\R" group="-1"/>
>>
>>         <filter class="solr.NGramFilterFactory"
>>             minGramSize="3" maxGramSize="3"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>
>>     </analyzer>
>>
>>     <analyzer type="query">
>>
>>         <tokenizer class="solr.NGramTokenizerFactory"
>>             minGramSize="3" maxGramSize="3"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>
>>     </analyzer>
>> </fieldType>
>>
>> Then in the search, I use the edismax parser with mm=100%, so given the
>> documents
>>
>>
>> {"id":"test1","content":"
>> encryption
>> 10.0.100.22
>> description
>> "}
>>
>> {"id":"test2","content":"
>> 10.100.0.22
>> description
>> "}
>>
>> and the query content:encryption, this will turn into
>>
>> "parsedquery_toString":
>>
>> "+((content:enc content:ncr content:cry content:ryp
>> content:ypt content:pti content:tio content:ion)~8)",
>>
>> and return only the first document. All fine and dandy. But I have a
>> problem with possible false positives. If the search is e.g.
>>
>> content:.100.22
>>
>> then the generated query will be
>>
>> "parsedquery_toString":
>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
>>
>> and because all of tokens are also generated for document test2 in the
>> proximity of 5, both documents will wrongly be returned.
>>
>> So somehow I'd need to express the query "content:.10 content:100
>> content:00. content:0.2 content:.22" with *the tokens exactly in this
>> order and nothing in between*. Is this somehow possible, maybe by using
>> the termvectors/termpositions stuff? Or am I trying to do something
>> that's fundamentally impossible? Other good ideas how to achieve this
>> kind of behaviour?
>>
>> Thanks
>> Christian
>>
>>
>>

Re: Exact substring search with ngrams

Reply via email to