First of all, to say that the problem is sitting in front of the monitor is *insulting my keyboard*. He sits between me and the monitor and has done nothing wrong. The problem is actually my favorite phrase: PEBKAC. The* "Problem exists between keyboard and chair."*
I believe that I have this working, using <<< >>> to turn off the optimizer for the single line. But it can surely use some review. More on that later. I've re-read the PCRE documentation at https://www.pcre.org/original/doc/html/index.html, focusing for now on the pcrepattern info. I see at least some of the errors in my test script. I still don't know why the \1 isn't working, but I' ve moved on to the (previously unknown to me ability for the) use of named references. You're going to laugh hysterically at me when you realize what I'm really trying to accomplish and how much more complicated it is than the silly test. But I think I'm getting close. This has been a very good lesson, and I will hopefully be beneficial to others by example. Ultimately, I need to match (and give a negative score for the match to help let that message get through) any email where the username portion of the TO field matches the second level domain name (1 to the left of the TLD) of the FROM address. That's clearly a far more complicated query. *Why in the world would I want to match such a thing?* They've had me set up a handful of wildcard subdomains. Essentially any message sent to <anything>@zackary.ourcharity.org for example will actually go to the zack...@ourcharity.org mailbox. That works fine. There's about 20 people with their own subdomain. For tracking purposes for one of our projects, Zackary and other staff are having people who are part of organizations that we help email him by using theirsecondleveldomainn...@zackary.ourcharity.org. That messages usually would come from outside.per...@theirsecondleveldomainname.org or outside.per...@subdomain.theirsecondleveldomainname.org, but could also come from a person's personal email like gmail/Outlook, The program people can then search/sort by TO and gather all of the messages related to that outside org. Fine, it's a weird way of working, but it's what the powers that be decided on. This helps with reporting and helps us to get the funding to continue helping these people. (every once and a while I'm reminded that the IT work I do actually is for a good cause and really does help people, despite my frustrations of a silly low IT budget). The problem is that I'll never know all of the organizations that they're giving out these addresses to, and it can be hundreds of different inbound addresses, so there's nothing I can do in advance. They've been doing this for a while, and we're seeing outside compromised email accounts causing these organization-unique addresses to get on spam lists. SO, ultimately what I want to do is negatively score any email where the userpart of the to address matches the second level of the domain name. That won't help get the legitimate messages from people's personal gmail get through, but it should help us ensure that messages from each org that's sent to a matching to: address gets through, even if what they're talking about might fail bayesian tests. Before I show what I came up with, I've discovered an oddity. in BombHeaderRe, testing the following: (MatchThisWord)=>-19 will show a negative 19 score when a match is made in the analyze GUI, as expected However doing any of thse <<<(MatchThisWord)>>>=>-19 <<<MatchThisWord>>>=>-19 ~<<<MatchThisWord>>>~=>-19 <<-- which I tried because I originally had an or in there will show me a score of positive 25, the bomb valance I have set. *It seems like using <<< >>> to turn of regex optimization might break the 2nd parameter from being recognized. Am I doing something very wrong or is this a bug?* And now for the regex I've come up with as a new starting point: *DISCLAIMER TO ANYONE READING THIS IN THE FUTURE* - while this seems to work for me, it's surely at least imperfect if not horribly inefficient or even wrong or broken!!! Here's the regex I've built. It seems to work in ASSP and test properly at https://regextr.com with PCRE selected as the engine and the case insensitive flag ticked (?:^|\r?\n)(?:to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n|from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n) This appears to match: x-whatever: bla bla to: "my name" <*ThirdParty2Level-Domain*@OurCharity.org> subject: testing from: "them" <whatever.e...@bla.bla.*ThirdParty2Level-Domain*.them> asdf and with the from appearing before the to. I do not know of a way to make the order of to and from insignificant, so I've done an "or" in between the first part of the regex which looks for to then from and the second part which looks for from then to. *Would it be more efficient for ASSP to have 2 separate lines, one for to first the other for from first?* Here's my thinking and explanation of my understanding of the regex that I wrote. I am VERY interested in corrections and suggestions for improvement, especially relating to efficiency (and obviously flawed logic and/or cases where what I've done would or wouldn't match as I'm thinking). Guidance here won't only help me perfect this specific regex for ASSP use, but will hopefully help others looking for other more complex than typical regex help with ASSP. I'll definitely be limiting the to domains to those that we use here to speed this up a bit, but I kept it more generic here. I also tried to see a way where lookaheads might help, but I'm not quite there yet.... Would they be helpful here? Starting from the beginning: (?:^|\r?\n) start with either the start of the string or a \r?\n - sometimes there's a \r but always a \n Is \r?\n recommended? Is there a better way? Then we're going to do 2 big OR's, first looking for to then from, then from then to. (?: starts this big or, with the ?: indicating that it's a non-capturing group The TO then From part is this: to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n broken out to: Find to: immediately after the previously found newline or start of string) (?:.*?[\s\<])*? non-capturing match for any characters repeated as long as they end with a space or < now we should be at the point where the username starts (?<TheMatch>[a-z\d\-]+)\@ get a named match called TOFirstMatch for any a-z number - combination that ends in the now escaped @ (?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n then just make sure that what follows the @ is a-z decimal and dahes, each part ending in a . with a 2-6 letter TLD ending the hostname followed by an optional > and then \n or \r to end the line (?:.+\r?\n)*? then ignore future lines which aren't blank until we a line starting with from: from:.*? line stars with from: followed by any characters \@(?:[a-z\d\-]+\.)*? find @valid.sub. part of from address \g{TOFirstMatch} use the \g{} syntax to match the named backreference \.[a-z]{2,6}\>?\r?\n) immediately followed by .tld 2-6 characters in length, an optional > and a \n or \r | then an OR and we do the whole thing again but with From First from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n) from:.*? from: followed by anything until we hit \@(?:[a-z\d\-]+\.)*? and @ sign followed by any number of hostname followed by . (?<FROMFirstMatch>[a-z\d\-]+) find the second level domain name and call is FROMFirstMatch \.[a-z]{2,6}\>?\r?\n followed by a .tld of 2 to 6 characters, an optional closing > and a \n or \r (?:.+\r?\n)*? move past non blank lines until we hit to:(?:.*?[\s\<])*? to: optionally followed by whatever characters ending in space or < \g{FROMFirstMatch}\@ now look for the second level domain match from the from: line immediately followed by an @ sign (?:[a-z\d\-]+\.)+ then hostnames separated by dots, at least 1 [a-z]{2,6}\>?\r?\n) followed by a 2-6 character tld, an optional > and a \n or \r? ) closing out the or between the MatchToFirst and FROMFirstMatch sections. Whew. : On Thu, Nov 4, 2021 at 4:53 AM Thomas Eckardt <thomas.ecka...@thockar.com> wrote: > forgot to say: > > if assp requires to capture the match for a regex, the code would be for > example > > $string =~ /($testReRE)/ > $match = $1; > > so - at runtime the regex is > > ((?^u:(?is:(?:^|\n\r).*(searchstring).*@.*\1.*))) > > IMHO you need to use named capture groups or \g or (?| > > Thomas > > > > Von: "Thomas Eckardt" <thomas.ecka...@thockar.com> > An: "ASSP development mailing list" < > assp-test@lists.sourceforge.net> > Datum: 04.11.2021 09:22 > Betreff: Re: [Assp-test] RegEx Backreferences - the basics > ------------------------------ > > > > to make backreferences working, regex optimization must be switched off > for the complete regex -> tested -> worked > > >I've seen posts here indicating that backreferencing matches is possible > with an unoptimized expression. > > so - the problem is sitting in front of the monitor :):) > > m/(?is:(?:^|\n\r).*(?:searchstring)*.*@.*\1* <.*@.*%5C1> <-- HERE .*)/ > > optimized - default is : 'no extra group capturing is allowed' > > >I've got to be missing something incredibly obvious. > > assp-do-not-optimize-regex > > > (?:^|\n\r).*(searchstring).*@.*\1.* > > assp makes it: > > (?is:(?:^|\n\r).*(searchstring).*@.*\1.*) > > think about your regex - read it from left to right as 'perl regex engine' > - what will happen? > beside the other mistakes the @ should be escaped \@ , because an ARRAY > @. may exist > > >Regex101.com seems to confirm that this works. > > does not check perl pcre > > and if I read the explanation there, I sure it will not work like you > expect > > > Thomas > > > > Von: "K Post" <nntp.p...@gmail.com> > An: "ASSP development mailing list" < > assp-test@lists.sourceforge.net> > Datum: 04.11.2021 02:29 > Betreff: [Assp-test] RegEx Backreferences - the basics > ------------------------------ > > > > I've got nothing in my TestRe file except for a single line: > > ~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~ > > The idea is to log any time there's a line that includes "searchstring" on > the right and left of an @. This is just a very rudimentary test because > backreferences seem to error for me. I would expect this to match > searchstring@searchstring > something else seachstring more @ whatever searchstring bla > If "searchstring" is to the right and left of an @ sign, it should match. > Regex101.com seems to confirm that this works. Like I said, super basic. > > However, if I enter ~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~ as the > only line in TestRe file, I get a warning in the log: > > - Reference to nonexistent group in regex; marked by <-- HERE in > m/(?is:(?:^|\n\r).*(?:searchstring)*.*@.*\1* <.*@.*%5C1> <-- HERE .*)/ > - try using unoptimized regex > > To my understanding, the <<< >>> surround should turn of regex > optimization for that line, which enables backreferencing (\1) to work and > the ~ is required because there's an or in there. Shouldn't the \1 > reference (searchstring) ? I don't understand why assp thinks that \1 is a > reference to a non-existent group. > > I also tried removing the <<< >>> and adding assp-do-not-optimize to the > top of the TestRe file. No difference. No matter how simple I make the > regex, even (.*)@\1, it still complains about the invalid backreference. > > > I've got to be missing something incredibly obvious. I've read through > the regex doc in docs, but that doesn't talk about backreferencing in ASSP > and I can't find anything in the GUI that makes mention. I've seen posts > here indicating that backreferencing matches is possible with an > unoptimized expression. > > A shove in the right direction would be greatly appreciated. > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > *https://lists.sourceforge.net/lists/listinfo/assp-test* > <https://lists.sourceforge.net/lists/listinfo/assp-test> > > > > > DISCLAIMER: > ******************************************************* > This email and any files transmitted with it may be confidential, legally > privileged and protected in law and are intended solely for the use of the > individual to whom it is addressed. > This email was multiple times scanned for viruses. There should be no > known virus in this email! > ******************************************************* > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > > DISCLAIMER: > ******************************************************* > This email and any files transmitted with it may be confidential, legally > privileged and protected in law and are intended solely for the use of the > individual to whom it is addressed. > This email was multiple times scanned for viruses. There should be no > known virus in this email! > ******************************************************* > > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test >
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test