Thanks, will try. > On 15 дек. 2014 г., at 21:02, Allison, Timothy B. <[email protected]> wrote: > > If you can't change the analyzer, you can programmatically build a > MultiPhraseQuery (you'd have to fill in the alternatives ... not a great > option) or a SpanNearQuery composed of span-wrapped RegexpQueries (rewrites > are taken care of for you). > > You might also want to look into using the ComplexPhraseQueryParser: > > "/5{1}<1-5>{1}<0-9>{2}/ /<0-9>{4}/ /<0-9>{4}/ /<0-9>{4}/" > > Make sure to "or" that with the regex to capture the "phrase" without > spaces/hyphens: "5{1}<1-5>{1}<0-9>{14}" > > I can't vouch for performance with the above options... > > Whichever path you take, make sure that the MultiTermQuery.RewriteMethod > and/or maxBooleanClauses are set appropriately. > > -----Original Message----- > From: Valentin Popov [mailto:[email protected]] > Sent: Monday, December 15, 2014 8:35 AM > To: [email protected] > Subject: Re: multiterm numbers regexp search > > Mike, thanks. > > Problem is that we cant change analyzer, as bank need a search not only for > card numbers for compliance and already exist storage is hundred millions of > emails. My thinking is make multiterm regexp search query, or search of > combination of regexp queries with some distance between them. Main idea is > to search possible combination of digits, as them has a rule, for mastercard > it is start with five, second number must be between 1-5 other 14 must be > digits. > > Thanks > > >> On 15 дек. 2014 г., at 16:00, Michael Sokolov >> <[email protected]> wrote: >> >> You probably don't want to use StandardAnalyzer: maybe try >> WhitespaceAnalyzer, but you'll need to enhance your regex a little to deal >> with punctuation since WA may give you tokens like: >> >> 5106-7922-9469-8422. >> >> "5106-7922-9469-8422" >> >> etc >> >> -Mike >> >> On 12/15/14 3:45 AM, Valentin Popov wrote: >>> I have a need to find mastercard numbers with regular expression. >>> >>> I’m using Query query = new RegexpQuery(new Term("body", >>> "5{1}<1-5>{1}<0-9>{14}"), RegExp.ALL) to search numbers in email’s body and >>> StandardAnalizer used for body indexing. So number like 5106792294698422 >>> will be indexed as it is and all mastercard numbers will be on search >>> results, but numbers like 5106 7922 9469 8422 will be indexed as 4 tokens >>> 5106, 7922, 9469, 8422, simular for 5106-7922-9469-8422. >>> >>> Any ideas how to find the sequence of numbers with spaces, dashes etc? >>> Maybe multiterm regexp search query? >>> >>> >>> Regards, >>> Valentin Popov >>> >>> >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > Regards, > Valentin Popov > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
Regards, Valentin Popov --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
