If you can't change the analyzer, you can programmatically build a
MultiPhraseQuery (you'd have to fill in the alternatives ... not a great
option) or a SpanNearQuery composed of span-wrapped RegexpQueries (rewrites are
taken care of for you).
You might also want to look into using the ComplexPhraseQueryParser:
"/5{1}<1-5>{1}<0-9>{2}/ /<0-9>{4}/ /<0-9>{4}/ /<0-9>{4}/"
Make sure to "or" that with the regex to capture the "phrase" without
spaces/hyphens: "5{1}<1-5>{1}<0-9>{14}"
I can't vouch for performance with the above options...
Whichever path you take, make sure that the MultiTermQuery.RewriteMethod and/or
maxBooleanClauses are set appropriately.
-----Original Message-----
From: Valentin Popov [mailto:[email protected]]
Sent: Monday, December 15, 2014 8:35 AM
To: [email protected]
Subject: Re: multiterm numbers regexp search
Mike, thanks.
Problem is that we cant change analyzer, as bank need a search not only for
card numbers for compliance and already exist storage is hundred millions of
emails. My thinking is make multiterm regexp search query, or search of
combination of regexp queries with some distance between them. Main idea is to
search possible combination of digits, as them has a rule, for mastercard it is
start with five, second number must be between 1-5 other 14 must be digits.
Thanks
> On 15 дек. 2014 г., at 16:00, Michael Sokolov
> <[email protected]> wrote:
>
> You probably don't want to use StandardAnalyzer: maybe try
> WhitespaceAnalyzer, but you'll need to enhance your regex a little to deal
> with punctuation since WA may give you tokens like:
>
> 5106-7922-9469-8422.
>
> "5106-7922-9469-8422"
>
> etc
>
> -Mike
>
> On 12/15/14 3:45 AM, Valentin Popov wrote:
>> I have a need to find mastercard numbers with regular expression.
>>
>> I’m using Query query = new RegexpQuery(new Term("body",
>> "5{1}<1-5>{1}<0-9>{14}"), RegExp.ALL) to search numbers in email’s body and
>> StandardAnalizer used for body indexing. So number like 5106792294698422
>> will be indexed as it is and all mastercard numbers will be on search
>> results, but numbers like 5106 7922 9469 8422 will be indexed as 4 tokens
>> 5106, 7922, 9469, 8422, simular for 5106-7922-9469-8422.
>>
>> Any ideas how to find the sequence of numbers with spaces, dashes etc? Maybe
>> multiterm regexp search query?
>>
>>
>> Regards,
>> Valentin Popov
>>
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
Regards,
Valentin Popov
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]