Re: Exact substring search with ngrams

2015-08-27 Thread Christian Ramseyer
On 26/08/15 18:05, Erick Erickson wrote:
> bq: my dog
> has fleas
> I wouldn't  want some variant of "og ha" to match,
> 
> Here's where the mysterious "positionIncrementGap" comes in. If you
> make this field "multiValued",  and index this like this:
> 
> my dog
> has fleas
> 
> 
> then the position of "dog" will be 2 and the position of "has" will be
> 102 assuming
> the positionIncrementGap is the default 100. N.B. I'm not sure you'll
> see this in the
> admin/analysis page or not.
> 
> Anyway, now your example won't match across the two parts unless
> you specify a "slop" up in the 101 range.

Oh that's nifty, thanks!



Re: Exact substring search with ngrams

2015-08-26 Thread Christian Ramseyer
On 26/08/15 00:24, Erick Erickson wrote:
> Hmmm, this sounds like a nonsensical question, but "what do you mean
> by arbitrary substring"?
> 
> Because if your substrings consist of whole _tokens_, then ngramming
> is totally unnecessary (and gets in the way). Phrase queries with no slop
> fulfill this requirement.
> 
> But let's assume you need to march within tokens, i.e. if the doc
> contains "my dog has fleas", you need to match input like "as fle", in this
> case ngramming is an option.

Yeah the "as fle"-thing is exactly what I want to achieve.

> 
> You have substantially different index and query time chains. The result is 
> that
> the offsets for all the grams at index time are the same in the quick 
> experiment
> I tried, all were 1. But at query time, each gram had an incremented position.
> 
> I'd start by using the query time analysis chain for indexing also. Next, I'd
> try enclosing multiple words in double quotes at query time and go from there.
> What you have now is an anti-pattern in that having substantially
> different index
> and query time analysis chains is not something that's likely to be very
> predictable unless you know _exactly_ what the consequences are.
> 
> The admin/analysis page is your friend, in this case check the
> "verbose" checkbox
> to see what I mean.

Hmm interesting. I had the additional \R tokenizer in the index chain
because the the document can be multiple lines (but the search text is
always a single line) and if the document was

my dog
has fleas

I wouldn't want some variant of "og ha" to match, but I didn't realize
it didn't give me any positions like you noticed.

I'll try to experiment some more, thanks for the hints!

Chris

> 
> Best,
> Erick
> 
> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer  wrote:
>> Hi
>>
>> I'm trying to build an index for technical documents that basically
>> works like "grep", i.e. the user gives an arbitray substring somewhere
>> in a line of a document and the exact matches will be returned. I
>> specifically want no stemming etc. and keep all whitespace, parentheses
>> etc. because they might be significant. The only normalization is that
>> the search should be case-insensitvie.
>>
>> I tried to achieve this by tokenizing on line breaks, and then building
>> trigrams of the individual lines:
>>
>> 
>>
>> 
>>
>> > pattern="\R" group="-1"/>
>>
>> > minGramSize="3" maxGramSize="3"/>
>> 
>>
>> 
>>
>> 
>>
>> > minGramSize="3" maxGramSize="3"/>
>> 
>>
>> 
>> 
>>
>> Then in the search, I use the edismax parser with mm=100%, so given the
>> documents
>>
>>
>> {"id":"test1","content":"
>> encryption
>> 10.0.100.22
>> description
>> "}
>>
>> {"id":"test2","content":"
>> 10.100.0.22
>> description
>> "}
>>
>> and the query content:encryption, this will turn into
>>
>> "parsedquery_toString":
>>
>> "+((content:enc content:ncr content:cry content:ryp
>> content:ypt content:pti content:tio content:ion)~8)",
>>
>> and return only the first document. All fine and dandy. But I have a
>> problem with possible false positives. If the search is e.g.
>>
>> content:.100.22
>>
>> then the generated query will be
>>
>> "parsedquery_toString":
>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
>>
>> and because all of tokens are also generated for document test2 in the
>> proximity of 5, both documents will wrongly be returned.
>>
>> So somehow I'd need to express the query "content:.10 content:100
>> content:00. content:0.2 content:.22" with *the tokens exactly in this
>> order and nothing in between*. Is this somehow possible, maybe by using
>> the termvectors/termpositions stuff? Or am I trying to do something
>> that's fundamentally impossible? Other good ideas how to achieve this
>> kind of behaviour?
>>
>> Thanks
>> Christian
>>
>>
>>



Exact substring search with ngrams

2015-08-25 Thread Christian Ramseyer
Hi

I'm trying to build an index for technical documents that basically
works like "grep", i.e. the user gives an arbitray substring somewhere
in a line of a document and the exact matches will be returned. I
specifically want no stemming etc. and keep all whitespace, parentheses
etc. because they might be significant. The only normalization is that
the search should be case-insensitvie.

I tried to achieve this by tokenizing on line breaks, and then building
trigrams of the individual lines:




















Then in the search, I use the edismax parser with mm=100%, so given the
documents


{"id":"test1","content":"
encryption
10.0.100.22
description
"}

{"id":"test2","content":"
10.100.0.22
description
"}

and the query content:encryption, this will turn into

"parsedquery_toString":

"+((content:enc content:ncr content:cry content:ryp
content:ypt content:pti content:tio content:ion)~8)",

and return only the first document. All fine and dandy. But I have a
problem with possible false positives. If the search is e.g.

content:.100.22

then the generated query will be

"parsedquery_toString":
"+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",

and because all of tokens are also generated for document test2 in the
proximity of 5, both documents will wrongly be returned.

So somehow I'd need to express the query "content:.10 content:100
content:00. content:0.2 content:.22" with *the tokens exactly in this
order and nothing in between*. Is this somehow possible, maybe by using
the termvectors/termpositions stuff? Or am I trying to do something
that's fundamentally impossible? Other good ideas how to achieve this
kind of behaviour?

Thanks
Christian





Re: Multi-Tenant Setup in Single Core

2013-11-13 Thread Christian Ramseyer
On 11/12/13 5:20 PM, Shawn Heisey wrote:
> Ensure that all handler names start with a slash character, so they are
> things like "/query", "/select", and so on.  Make sure that handleSelect
> is set to false on your requestDispatcher config.  This is how Solr 4.x
> examples are set up already.
> 
> With that config, the "qt" parameter will not function and will be
> ignored -- you must use the request handler path as part of the URL --
> /solr/corename/handler.


Great thanks, I already had it this way but I wasn't aware of these fine
details, very helpful.

Christian




Re: Multi-Tenant Setup in Single Core

2013-11-12 Thread Christian Ramseyer
On 11/12/13 1:51 PM, Erick Erickson wrote:
> When you mention velocity, you're talking about the stock Velocity Response
> Writer that comes with the example? Because if you're exposing the Solr
> http address to the world, accessing each others data is the least of your
> worries. To whit:
> 
> http://machine:8983/solr/collection/update?commit=true&stream.body=
> *:*
> 
> Often people put a dedicated app in front of their Solr and secure that,
> then put a firewall between their Solr and the world that only lets
> requests through from the known app machines.

Thanks Erick

Yes it is the stock velocity writer. But, this is an intranet app in a
segmented environment. From the client networks, Solr can only be
accessed via an Apache Reverse Proxy, and the only URL paths that can be
accessed are the velocity response handlers at

https://reverse-proxy/mapping-to-solr/searchui_client*

All other paths are blocked at the reverse proxy level.

So I'm worried about something that uses these URL paths, say

https://reverse-proxy/mapping-to-solr/searchui_client?qt=update&;
commit=true&stream.body=*:*

to stay with your example.

Christian

> 
> Best,
> Erick
> 
> 
> On Tue, Nov 12, 2013 at 7:09 AM, Christian Ramseyer  wrote:
> 
>> Hi guys
>>
>> I'm prototyping a multi-tenant search. I have various document sources and
>> a tenant can potentially access subsets of any source.
>> Also tenants have overlapping access to the sources, why I'm trying to do
>> it in a single core.
>>
>> I'm doing this by labeling the source (origin, single value) and tag the
>> individual documents with a list of clients that can
>> access it (required_access_token, array). A tenant then gets a Velocity
>> search handler with invariant fq like this:
>>
>> > class="solr.SearchHandler">
>>  
>>  (origin:(client1docs OR generaldocs) AND
>> required_access_token:(client1))
>> 
>>
>> > class="solr.SearchHandler">
>> 
>>  (origin:(client2docs OR generaldocs) AND
>> required_access_token:(client2))
>> 
>>
>> > class="solr.SearchHandler">
>> 
>>  (origin:(client3docs OR generaldocs) AND
>> required_access_token:(client3))
>> 
>>
>> Access to the search handler by client is controlled via a reverse proxy,
>> and all the other handlers like /browse or /select
>> are not available.
>>
>> Do you guys see any obvious security problems with this? I'm especially
>> worried about some kind of "SQL Injection" into the query field
>> (edismax parser) in the velocity template handler which would allow to
>> override or add stuff to the invariant fq, or the ability to
>> select another query handler via URL parameters like
>> /searchui_client1?qt=searchui_client2 or similar.
>>
>> Do you think this setup can be reasonably safe?
>>
>> Thanks
>>
>> Christian
>>
> 



Multi-Tenant Setup in Single Core

2013-11-12 Thread Christian Ramseyer
Hi guys

I'm prototyping a multi-tenant search. I have various document sources and a 
tenant can potentially access subsets of any source.
Also tenants have overlapping access to the sources, why I'm trying to do it in 
a single core. 

I'm doing this by labeling the source (origin, single value) and tag the 
individual documents with a list of clients that can 
access it (required_access_token, array). A tenant then gets a Velocity search 
handler with invariant fq like this:


 
 (origin:(client1docs OR generaldocs) AND 
required_access_token:(client1))




 (origin:(client2docs OR generaldocs) AND 
required_access_token:(client2))




 (origin:(client3docs OR generaldocs) AND 
required_access_token:(client3))


Access to the search handler by client is controlled via a reverse proxy, and 
all the other handlers like /browse or /select
are not available.

Do you guys see any obvious security problems with this? I'm especially worried 
about some kind of "SQL Injection" into the query field
(edismax parser) in the velocity template handler which would allow to override 
or add stuff to the invariant fq, or the ability to 
select another query handler via URL parameters like 
/searchui_client1?qt=searchui_client2 or similar.

Do you think this setup can be reasonably safe?

Thanks

Christian