On 26/08/15 00:24, Erick Erickson wrote: > Hmmm, this sounds like a nonsensical question, but "what do you mean > by arbitrary substring"? > > Because if your substrings consist of whole _tokens_, then ngramming > is totally unnecessary (and gets in the way). Phrase queries with no slop > fulfill this requirement. > > But let's assume you need to march within tokens, i.e. if the doc > contains "my dog has fleas", you need to match input like "as fle", in this > case ngramming is an option.
Yeah the "as fle"-thing is exactly what I want to achieve. > > You have substantially different index and query time chains. The result is > that > the offsets for all the grams at index time are the same in the quick > experiment > I tried, all were 1. But at query time, each gram had an incremented position. > > I'd start by using the query time analysis chain for indexing also. Next, I'd > try enclosing multiple words in double quotes at query time and go from there. > What you have now is an anti-pattern in that having substantially > different index > and query time analysis chains is not something that's likely to be very > predictable unless you know _exactly_ what the consequences are. > > The admin/analysis page is your friend, in this case check the > "verbose" checkbox > to see what I mean. Hmm interesting. I had the additional \R tokenizer in the index chain because the the document can be multiple lines (but the search text is always a single line) and if the document was my dog has fleas I wouldn't want some variant of "og ha" to match, but I didn't realize it didn't give me any positions like you noticed. I'll try to experiment some more, thanks for the hints! Chris > > Best, > Erick > > On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> wrote: >> Hi >> >> I'm trying to build an index for technical documents that basically >> works like "grep", i.e. the user gives an arbitray substring somewhere >> in a line of a document and the exact matches will be returned. I >> specifically want no stemming etc. and keep all whitespace, parentheses >> etc. because they might be significant. The only normalization is that >> the search should be case-insensitvie. >> >> I tried to achieve this by tokenizing on line breaks, and then building >> trigrams of the individual lines: >> >> <fieldType name="configtext_trigram" class="solr.TextField" > >> >> <analyzer type="index"> >> >> <tokenizer class="solr.PatternTokenizerFactory" >> pattern="\R" group="-1"/> >> >> <filter class="solr.NGramFilterFactory" >> minGramSize="3" maxGramSize="3"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> >> </analyzer> >> >> <analyzer type="query"> >> >> <tokenizer class="solr.NGramTokenizerFactory" >> minGramSize="3" maxGramSize="3"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> >> </analyzer> >> </fieldType> >> >> Then in the search, I use the edismax parser with mm=100%, so given the >> documents >> >> >> {"id":"test1","content":" >> encryption >> 10.0.100.22 >> description >> "} >> >> {"id":"test2","content":" >> 10.100.0.22 >> description >> "} >> >> and the query content:encryption, this will turn into >> >> "parsedquery_toString": >> >> "+((content:enc content:ncr content:cry content:ryp >> content:ypt content:pti content:tio content:ion)~8)", >> >> and return only the first document. All fine and dandy. But I have a >> problem with possible false positives. If the search is e.g. >> >> content:.100.22 >> >> then the generated query will be >> >> "parsedquery_toString": >> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)", >> >> and because all of tokens are also generated for document test2 in the >> proximity of 5, both documents will wrongly be returned. >> >> So somehow I'd need to express the query "content:.10 content:100 >> content:00. content:0.2 content:.22" with *the tokens exactly in this >> order and nothing in between*. Is this somehow possible, maybe by using >> the termvectors/termpositions stuff? Or am I trying to do something >> that's fundamentally impossible? Other good ideas how to achieve this >> kind of behaviour? >> >> Thanks >> Christian >> >> >>