Erez:

Before going too far down this path, understand that even if you can get this
syntax to work, you're going to pay a _very_ significant performance hit if you
have any decent size corpus. Conceptually, what happens is that all the terms
that the regex matches are made into clauses. So let's take a very simple
wildcard case:

field1 has two values f1A and f1B
field2 has two values, f2A and f2B

The result of asking for "field1:f1? field2:f2?" (as a phrase) is
"field1:f1A field2:f2A"
OR
"field1:f1A field2:f2B"
OR
"field1:f1B field2:f2A"
OR
"field1:f1B field2:f2B"

which may take quite a while to execute, and that doesn't even
include the time that it'll take to enumerate the terms in a field that
match your regex, which can get very ugly if your regex is such that
it has to examine _every_ term in the field, i.e. the entire terms list
for the field for the entire corpus.

This might be an XY problem, what problem are you solving with
regexes? Might you be better off constructing better analysis chains?
The reason I ask is that unless you have technical users, regexes are
unlikely to be even used....

FWIW,
Erick


On Sun, May 22, 2016 at 8:19 AM, Erez Michalak <emicha...@varonis.com> wrote:
> Thanks you Ahmet for the JIRA reference - it looks really promising and I'll 
> check it out.
>
> Regarding your question - once a piece of text is tokenized, it seems like 
> there is no way to perform a regex query across term boundaries. The pure 
> regex is good as long I'm querying for a single term.
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Sunday, May 22, 2016 4:49 PM
> To: solr-user@lucene.apache.org; Erez Michalak <emicha...@varonis.com>
> Subject: Re: How to use a regex search within a phrase query?
>
> Hi Erez,
>
> I don't think it is possible to combine regex with phrase out-of-the-box.
> However, there is https://issues.apache.org/jira/browse/LUCENE-5205 for the 
> task.
>
> Can't you define your query in terms of pure regex?
> something like /[0-9]{3} .* [0-9]{4}/
>
> ahmet
>
>
> On Sunday, May 22, 2016 1:37 PM, Erez Michalak <emicha...@varonis.com> wrote:
> Hey,
> I'm developing a search application based on SOLR 5.3.1, and would like to 
> add to it regex search capabilities on a specific tokenized text field named 
> 'content'.
> Is it possible to combine the default regex syntax within a phrase query (and 
> moreover, within a proximity search)? If so, please instruct me how..
>
> Thanks in advance,
> Erez Michalak
>
> p.s.
> Maybe the following example will make my question clearer:
> The query content:/[0-9]{3}/ returns documents with (at least one) 3 digits 
> token as expected.
> However,
>
> *         the query content:"/[0-9]{3}/ /[0-9]{4}/" doesn't match the 
> contents '123-1234' and '123 1234', even though they are tokenized to two 
> tokens ('123' and '1234') which individually match each part of the query
>
> *         the query content:"/[0-9]{3}/ example" doesn't match the content 
> '123 example'
>
> *         even the query content:"/[0-9]{3}/" (same as the query that works 
> but surrounded with quotation marks) doesn't return documents with 3 digits 
> token!
>
> *         etc.
>
>
> ________________________________
> This email and any attachments thereto may contain private, confidential, and 
> privileged material for the sole use of the intended recipient. Any review, 
> copying, or distribution of this email (or any attachments thereto) by others 
> is strictly prohibited. If you are not the intended recipient, please contact 
> the sender immediately and permanently delete the original and any copies of 
> this email and any attachments thereto.
> ________________________________
> This email and any attachments thereto may contain private, confidential, and 
> privileged material for the sole use of the intended recipient. Any review, 
> copying, or distribution of this email (or any attachments thereto) by others 
> is strictly prohibited. If you are not the intended recipient, please contact 
> the sender immediately and permanently delete the original and any copies of 
> this email and any attachments thereto.

Reply via email to