Re: Exact substring search with ngrams
On 26/08/15 18:05, Erick Erickson wrote: > bq: my dog > has fleas > I wouldn't want some variant of "og ha" to match, > > Here's where the mysterious "positionIncrementGap" comes in. If you > make this field "multiValued", and index this like this: > > my dog > has fleas > > > then the position of "dog" will be 2 and the position of "has" will be > 102 assuming > the positionIncrementGap is the default 100. N.B. I'm not sure you'll > see this in the > admin/analysis page or not. > > Anyway, now your example won't match across the two parts unless > you specify a "slop" up in the 101 range. Oh that's nifty, thanks!
Re: Exact substring search with ngrams
On 26/08/15 00:24, Erick Erickson wrote: > Hmmm, this sounds like a nonsensical question, but "what do you mean > by arbitrary substring"? > > Because if your substrings consist of whole _tokens_, then ngramming > is totally unnecessary (and gets in the way). Phrase queries with no slop > fulfill this requirement. > > But let's assume you need to march within tokens, i.e. if the doc > contains "my dog has fleas", you need to match input like "as fle", in this > case ngramming is an option. Yeah the "as fle"-thing is exactly what I want to achieve. > > You have substantially different index and query time chains. The result is > that > the offsets for all the grams at index time are the same in the quick > experiment > I tried, all were 1. But at query time, each gram had an incremented position. > > I'd start by using the query time analysis chain for indexing also. Next, I'd > try enclosing multiple words in double quotes at query time and go from there. > What you have now is an anti-pattern in that having substantially > different index > and query time analysis chains is not something that's likely to be very > predictable unless you know _exactly_ what the consequences are. > > The admin/analysis page is your friend, in this case check the > "verbose" checkbox > to see what I mean. Hmm interesting. I had the additional \R tokenizer in the index chain because the the document can be multiple lines (but the search text is always a single line) and if the document was my dog has fleas I wouldn't want some variant of "og ha" to match, but I didn't realize it didn't give me any positions like you noticed. I'll try to experiment some more, thanks for the hints! Chris > > Best, > Erick > > On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer wrote: >> Hi >> >> I'm trying to build an index for technical documents that basically >> works like "grep", i.e. the user gives an arbitray substring somewhere >> in a line of a document and the exact matches will be returned. I >> specifically want no stemming etc. and keep all whitespace, parentheses >> etc. because they might be significant. The only normalization is that >> the search should be case-insensitvie. >> >> I tried to achieve this by tokenizing on line breaks, and then building >> trigrams of the individual lines: >> >> >> >> >> >> > pattern="\R" group="-1"/> >> >> > minGramSize="3" maxGramSize="3"/> >> >> >> >> >> >> >> > minGramSize="3" maxGramSize="3"/> >> >> >> >> >> >> Then in the search, I use the edismax parser with mm=100%, so given the >> documents >> >> >> {"id":"test1","content":" >> encryption >> 10.0.100.22 >> description >> "} >> >> {"id":"test2","content":" >> 10.100.0.22 >> description >> "} >> >> and the query content:encryption, this will turn into >> >> "parsedquery_toString": >> >> "+((content:enc content:ncr content:cry content:ryp >> content:ypt content:pti content:tio content:ion)~8)", >> >> and return only the first document. All fine and dandy. But I have a >> problem with possible false positives. If the search is e.g. >> >> content:.100.22 >> >> then the generated query will be >> >> "parsedquery_toString": >> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)", >> >> and because all of tokens are also generated for document test2 in the >> proximity of 5, both documents will wrongly be returned. >> >> So somehow I'd need to express the query "content:.10 content:100 >> content:00. content:0.2 content:.22" with *the tokens exactly in this >> order and nothing in between*. Is this somehow possible, maybe by using >> the termvectors/termpositions stuff? Or am I trying to do something >> that's fundamentally impossible? Other good ideas how to achieve this >> kind of behaviour? >> >> Thanks >> Christian >> >> >>
Exact substring search with ngrams
Hi I'm trying to build an index for technical documents that basically works like "grep", i.e. the user gives an arbitray substring somewhere in a line of a document and the exact matches will be returned. I specifically want no stemming etc. and keep all whitespace, parentheses etc. because they might be significant. The only normalization is that the search should be case-insensitvie. I tried to achieve this by tokenizing on line breaks, and then building trigrams of the individual lines: Then in the search, I use the edismax parser with mm=100%, so given the documents {"id":"test1","content":" encryption 10.0.100.22 description "} {"id":"test2","content":" 10.100.0.22 description "} and the query content:encryption, this will turn into "parsedquery_toString": "+((content:enc content:ncr content:cry content:ryp content:ypt content:pti content:tio content:ion)~8)", and return only the first document. All fine and dandy. But I have a problem with possible false positives. If the search is e.g. content:.100.22 then the generated query will be "parsedquery_toString": "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)", and because all of tokens are also generated for document test2 in the proximity of 5, both documents will wrongly be returned. So somehow I'd need to express the query "content:.10 content:100 content:00. content:0.2 content:.22" with *the tokens exactly in this order and nothing in between*. Is this somehow possible, maybe by using the termvectors/termpositions stuff? Or am I trying to do something that's fundamentally impossible? Other good ideas how to achieve this kind of behaviour? Thanks Christian
Re: Multi-Tenant Setup in Single Core
On 11/12/13 5:20 PM, Shawn Heisey wrote: > Ensure that all handler names start with a slash character, so they are > things like "/query", "/select", and so on. Make sure that handleSelect > is set to false on your requestDispatcher config. This is how Solr 4.x > examples are set up already. > > With that config, the "qt" parameter will not function and will be > ignored -- you must use the request handler path as part of the URL -- > /solr/corename/handler. Great thanks, I already had it this way but I wasn't aware of these fine details, very helpful. Christian
Re: Multi-Tenant Setup in Single Core
On 11/12/13 1:51 PM, Erick Erickson wrote: > When you mention velocity, you're talking about the stock Velocity Response > Writer that comes with the example? Because if you're exposing the Solr > http address to the world, accessing each others data is the least of your > worries. To whit: > > http://machine:8983/solr/collection/update?commit=true&stream.body= > *:* > > Often people put a dedicated app in front of their Solr and secure that, > then put a firewall between their Solr and the world that only lets > requests through from the known app machines. Thanks Erick Yes it is the stock velocity writer. But, this is an intranet app in a segmented environment. From the client networks, Solr can only be accessed via an Apache Reverse Proxy, and the only URL paths that can be accessed are the velocity response handlers at https://reverse-proxy/mapping-to-solr/searchui_client* All other paths are blocked at the reverse proxy level. So I'm worried about something that uses these URL paths, say https://reverse-proxy/mapping-to-solr/searchui_client?qt=update&; commit=true&stream.body=*:* to stay with your example. Christian > > Best, > Erick > > > On Tue, Nov 12, 2013 at 7:09 AM, Christian Ramseyer wrote: > >> Hi guys >> >> I'm prototyping a multi-tenant search. I have various document sources and >> a tenant can potentially access subsets of any source. >> Also tenants have overlapping access to the sources, why I'm trying to do >> it in a single core. >> >> I'm doing this by labeling the source (origin, single value) and tag the >> individual documents with a list of clients that can >> access it (required_access_token, array). A tenant then gets a Velocity >> search handler with invariant fq like this: >> >> > class="solr.SearchHandler"> >> >> (origin:(client1docs OR generaldocs) AND >> required_access_token:(client1)) >> >> >> > class="solr.SearchHandler"> >> >> (origin:(client2docs OR generaldocs) AND >> required_access_token:(client2)) >> >> >> > class="solr.SearchHandler"> >> >> (origin:(client3docs OR generaldocs) AND >> required_access_token:(client3)) >> >> >> Access to the search handler by client is controlled via a reverse proxy, >> and all the other handlers like /browse or /select >> are not available. >> >> Do you guys see any obvious security problems with this? I'm especially >> worried about some kind of "SQL Injection" into the query field >> (edismax parser) in the velocity template handler which would allow to >> override or add stuff to the invariant fq, or the ability to >> select another query handler via URL parameters like >> /searchui_client1?qt=searchui_client2 or similar. >> >> Do you think this setup can be reasonably safe? >> >> Thanks >> >> Christian >> >
Multi-Tenant Setup in Single Core
Hi guys I'm prototyping a multi-tenant search. I have various document sources and a tenant can potentially access subsets of any source. Also tenants have overlapping access to the sources, why I'm trying to do it in a single core. I'm doing this by labeling the source (origin, single value) and tag the individual documents with a list of clients that can access it (required_access_token, array). A tenant then gets a Velocity search handler with invariant fq like this: (origin:(client1docs OR generaldocs) AND required_access_token:(client1)) (origin:(client2docs OR generaldocs) AND required_access_token:(client2)) (origin:(client3docs OR generaldocs) AND required_access_token:(client3)) Access to the search handler by client is controlled via a reverse proxy, and all the other handlers like /browse or /select are not available. Do you guys see any obvious security problems with this? I'm especially worried about some kind of "SQL Injection" into the query field (edismax parser) in the velocity template handler which would allow to override or add stuff to the invariant fq, or the ability to select another query handler via URL parameters like /searchui_client1?qt=searchui_client2 or similar. Do you think this setup can be reasonably safe? Thanks Christian