bq: my dog has fleas I wouldn't want some variant of "og ha" to match,
Here's where the mysterious "positionIncrementGap" comes in. If you make this field "multiValued", and index this like this: <doc> <field name="blah">my dog</field> <field name="blah">has fleas</field> <doc> or equivalently in SolrJ just doc.addField("blah", "my dog"); doc.addField("blah", "has fleas"); then the position of "dog" will be 2 and the position of "has" will be 102 assuming the positionIncrementGap is the default 100. N.B. I'm not sure you'll see this in the admin/analysis page or not..... Anyway, now your example won't match across the two parts unless you specify a "slop" up in the 101 range. Best, Erick On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer <r...@networkz.ch> wrote: > On 26/08/15 00:24, Erick Erickson wrote: >> Hmmm, this sounds like a nonsensical question, but "what do you mean >> by arbitrary substring"? >> >> Because if your substrings consist of whole _tokens_, then ngramming >> is totally unnecessary (and gets in the way). Phrase queries with no slop >> fulfill this requirement. >> >> But let's assume you need to march within tokens, i.e. if the doc >> contains "my dog has fleas", you need to match input like "as fle", in this >> case ngramming is an option. > > Yeah the "as fle"-thing is exactly what I want to achieve. > >> >> You have substantially different index and query time chains. The result is >> that >> the offsets for all the grams at index time are the same in the quick >> experiment >> I tried, all were 1. But at query time, each gram had an incremented >> position. >> >> I'd start by using the query time analysis chain for indexing also. Next, I'd >> try enclosing multiple words in double quotes at query time and go from >> there. >> What you have now is an anti-pattern in that having substantially >> different index >> and query time analysis chains is not something that's likely to be very >> predictable unless you know _exactly_ what the consequences are. >> >> The admin/analysis page is your friend, in this case check the >> "verbose" checkbox >> to see what I mean. > > Hmm interesting. I had the additional \R tokenizer in the index chain > because the the document can be multiple lines (but the search text is > always a single line) and if the document was > > my dog > has fleas > > I wouldn't want some variant of "og ha" to match, but I didn't realize > it didn't give me any positions like you noticed. > > I'll try to experiment some more, thanks for the hints! > > Chris > >> >> Best, >> Erick >> >> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer <r...@networkz.ch> wrote: >>> Hi >>> >>> I'm trying to build an index for technical documents that basically >>> works like "grep", i.e. the user gives an arbitray substring somewhere >>> in a line of a document and the exact matches will be returned. I >>> specifically want no stemming etc. and keep all whitespace, parentheses >>> etc. because they might be significant. The only normalization is that >>> the search should be case-insensitvie. >>> >>> I tried to achieve this by tokenizing on line breaks, and then building >>> trigrams of the individual lines: >>> >>> <fieldType name="configtext_trigram" class="solr.TextField" > >>> >>> <analyzer type="index"> >>> >>> <tokenizer class="solr.PatternTokenizerFactory" >>> pattern="\R" group="-1"/> >>> >>> <filter class="solr.NGramFilterFactory" >>> minGramSize="3" maxGramSize="3"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> >>> </analyzer> >>> >>> <analyzer type="query"> >>> >>> <tokenizer class="solr.NGramTokenizerFactory" >>> minGramSize="3" maxGramSize="3"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> >>> </analyzer> >>> </fieldType> >>> >>> Then in the search, I use the edismax parser with mm=100%, so given the >>> documents >>> >>> >>> {"id":"test1","content":" >>> encryption >>> 10.0.100.22 >>> description >>> "} >>> >>> {"id":"test2","content":" >>> 10.100.0.22 >>> description >>> "} >>> >>> and the query content:encryption, this will turn into >>> >>> "parsedquery_toString": >>> >>> "+((content:enc content:ncr content:cry content:ryp >>> content:ypt content:pti content:tio content:ion)~8)", >>> >>> and return only the first document. All fine and dandy. But I have a >>> problem with possible false positives. If the search is e.g. >>> >>> content:.100.22 >>> >>> then the generated query will be >>> >>> "parsedquery_toString": >>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)", >>> >>> and because all of tokens are also generated for document test2 in the >>> proximity of 5, both documents will wrongly be returned. >>> >>> So somehow I'd need to express the query "content:.10 content:100 >>> content:00. content:0.2 content:.22" with *the tokens exactly in this >>> order and nothing in between*. Is this somehow possible, maybe by using >>> the termvectors/termpositions stuff? Or am I trying to do something >>> that's fundamentally impossible? Other good ideas how to achieve this >>> kind of behaviour? >>> >>> Thanks >>> Christian >>> >>> >>> >