I'd play with the timeAllowed option with a full corpus to get a sense
of how painful these queries are. There's also the issue of the impact
of queries like this on other users to consider....

Other than that, I think you're on the right path in terms of
supporting some common use-cases with special indexing. Personally
though I think you'll have a difficult time getting that all to work
with both index-time and query parsing.

Consider in your proposal what happens if you stick
WordDelimiterFilterFactory in the mix. Then you'd be indexing 'd' 'd'
'd' for "\d\d\d". Then there's the issue of pulling the "special"
supports out of a complex query. How do you even know the search is a
regex in the first place? I mean someone could have a document that
contains regex clauses and be searching for the literal '[0-9]{2}' ;)

For the specific cases you mentioned, you're probably better off just
making all digits '1' in your "special" field and training your
technical users accordingly. And/or just supporting a few regex
queries from the UI.

Actually, the first thing I'd do is just turn your technical users
loose with regexes and no special support except timeAllowed. If
anyone actually winds up _using_ regexes, _then_ build in special
support once you knew it was necessary.

Best,
Erick

On Mon, May 23, 2016 at 2:59 AM, Erez Michalak <emicha...@varonis.com> wrote:
> Good points, thanks Erick.
>
> As you guessed, the use case is not in the main flow for the general user, 
> but an advanced flow for a technical one.
>
> Regarding the performance issue, I thought of a few optimizations for some 
> expected expressions I need to support.
> For instance, to walk around the digits regex in all my examples from the 
> mail below, I can simply index terms with '\d' instead of every digit (like 
> '\d\d\d' for '123').
> This enables a faster search as follows:
> * search for "\d\d\d" instead of "/[0-9]{3}/"
> * search for "\d\d\d \d\d\d\d" instead of "/[0-9]{3}/ /[0-9]{4}/"
> * search for "\d\d\d example" instead of "/[0-9]{3}/ example"
> Clearly, this approach supports very limited set of expressions in expense 
> for an increase in the index size.
> For the general case, though, regular expressions may indeed require a full 
> index scan. Seems like all I can do in that case is to warn the user in 
> advance that this may take a (long) while.
>
> Any further ideas on how to reduce the performance hit and survive the bad 
> impact of a full index scan are welcomed..
> Erez
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Sunday, May 22, 2016 7:43 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: How to use a regex search within a phrase query?
>
> Erez:
>
> Before going too far down this path, understand that even if you can get this 
> syntax to work, you're going to pay a _very_ significant performance hit if 
> you have any decent size corpus. Conceptually, what happens is that all the 
> terms that the regex matches are made into clauses. So let's take a very 
> simple wildcard case:
>
> field1 has two values f1A and f1B
> field2 has two values, f2A and f2B
>
> The result of asking for "field1:f1? field2:f2?" (as a phrase) is "field1:f1A 
> field2:f2A"
> OR
> "field1:f1A field2:f2B"
> OR
> "field1:f1B field2:f2A"
> OR
> "field1:f1B field2:f2B"
>
> which may take quite a while to execute, and that doesn't even include the 
> time that it'll take to enumerate the terms in a field that match your regex, 
> which can get very ugly if your regex is such that it has to examine _every_ 
> term in the field, i.e. the entire terms list for the field for the entire 
> corpus.
>
> This might be an XY problem, what problem are you solving with regexes? Might 
> you be better off constructing better analysis chains?
> The reason I ask is that unless you have technical users, regexes are 
> unlikely to be even used....
>
> FWIW,
> Erick
>
>
> On Sun, May 22, 2016 at 8:19 AM, Erez Michalak <emicha...@varonis.com> wrote:
>> Thanks you Ahmet for the JIRA reference - it looks really promising and I'll 
>> check it out.
>>
>> Regarding your question - once a piece of text is tokenized, it seems like 
>> there is no way to perform a regex query across term boundaries. The pure 
>> regex is good as long I'm querying for a single term.
>>
>>
>> -----Original Message-----
>> From: Ahmet Arslan [mailto:iori...@yahoo.com]
>> Sent: Sunday, May 22, 2016 4:49 PM
>> To: solr-user@lucene.apache.org; Erez Michalak <emicha...@varonis.com>
>> Subject: Re: How to use a regex search within a phrase query?
>>
>> Hi Erez,
>>
>> I don't think it is possible to combine regex with phrase out-of-the-box.
>> However, there is https://issues.apache.org/jira/browse/LUCENE-5205 for the 
>> task.
>>
>> Can't you define your query in terms of pure regex?
>> something like /[0-9]{3} .* [0-9]{4}/
>>
>> ahmet
>>
>>
>> On Sunday, May 22, 2016 1:37 PM, Erez Michalak <emicha...@varonis.com> wrote:
>> Hey,
>> I'm developing a search application based on SOLR 5.3.1, and would like to 
>> add to it regex search capabilities on a specific tokenized text field named 
>> 'content'.
>> Is it possible to combine the default regex syntax within a phrase query 
>> (and moreover, within a proximity search)? If so, please instruct me how..
>>
>> Thanks in advance,
>> Erez Michalak
>>
>> p.s.
>> Maybe the following example will make my question clearer:
>> The query content:/[0-9]{3}/ returns documents with (at least one) 3 digits 
>> token as expected.
>> However,
>>
>> *         the query content:"/[0-9]{3}/ /[0-9]{4}/" doesn't match the 
>> contents '123-1234' and '123 1234', even though they are tokenized to two 
>> tokens ('123' and '1234') which individually match each part of the query
>>
>> *         the query content:"/[0-9]{3}/ example" doesn't match the content 
>> '123 example'
>>
>> *         even the query content:"/[0-9]{3}/" (same as the query that works 
>> but surrounded with quotation marks) doesn't return documents with 3 digits 
>> token!
>>
>> *         etc.
>>
>>
>> ________________________________
>> This email and any attachments thereto may contain private, confidential, 
>> and privileged material for the sole use of the intended recipient. Any 
>> review, copying, or distribution of this email (or any attachments thereto) 
>> by others is strictly prohibited. If you are not the intended recipient, 
>> please contact the sender immediately and permanently delete the original 
>> and any copies of this email and any attachments thereto.
>> ________________________________
>> This email and any attachments thereto may contain private, confidential, 
>> and privileged material for the sole use of the intended recipient. Any 
>> review, copying, or distribution of this email (or any attachments thereto) 
>> by others is strictly prohibited. If you are not the intended recipient, 
>> please contact the sender immediately and permanently delete the original 
>> and any copies of this email and any attachments thereto.
> ________________________________
> This email and any attachments thereto may contain private, confidential, and 
> privileged material for the sole use of the intended recipient. Any review, 
> copying, or distribution of this email (or any attachments thereto) by others 
> is strictly prohibited. If you are not the intended recipient, please contact 
> the sender immediately and permanently delete the original and any copies of 
> this email and any attachments thereto.

Reply via email to