Re: [Assp-test] RegEx Backreferences - the basics

K Post Thu, 04 Nov 2021 16:19:15 -0700

First of all, to say that the problem is sitting in front of the
monitor is *insulting
my keyboard*.  He sits between me and the monitor and has done nothing
wrong.  The problem is actually my favorite phrase: PEBKAC.  The* "Problem
exists between keyboard and chair."*

I believe that I have this working, using <<<  >>> to turn off the
optimizer for the single line.  But it can surely use some review. More on
that later.

I've re-read the PCRE documentation at
https://www.pcre.org/original/doc/html/index.html, focusing for now on the
pcrepattern info.  I see at least some of the errors in my test script. I
still don't know why the \1 isn't working, but I' ve moved on to the
(previously unknown to me ability for the)  use of named references.

You're going to laugh hysterically at me when you realize what I'm really
trying to accomplish and how much more complicated it is than the silly
test. But I think I'm getting close.  This has been a very good lesson, and
I will hopefully be beneficial to others by example.

Ultimately, I need to match (and give a negative score for the match to
help let that message get through) any email where the username portion of
the TO field matches the second level domain name (1 to the left of the
TLD) of the FROM address.  That's clearly a far more complicated query.

*Why in the world would I want to match such a thing?*
They've had me set up a handful of wildcard subdomains.  Essentially any
message sent to <anything>@zackary.ourcharity.org for example will actually
go to the zack...@ourcharity.org mailbox.  That works fine.  There's about
20 people with their own subdomain.

For tracking purposes for one of our projects, Zackary and other staff are
having people who are part of organizations that we help email him by using
theirsecondleveldomainn...@zackary.ourcharity.org.   That messages usually
would come from outside.per...@theirsecondleveldomainname.org or
outside.per...@subdomain.theirsecondleveldomainname.org, but could also
come from a person's personal email like gmail/Outlook,

The program people can then search/sort by TO and gather all of the
messages related to that outside org.  Fine, it's a weird way of working,
but it's what the powers that be decided on.  This helps with reporting and
helps us to get the funding to continue helping these people.  (every once
and a while I'm reminded that the IT work I do actually is for a good cause
and really does help people, despite my frustrations of a silly low IT
budget).

The problem is that I'll never know all of the organizations that they're
giving out these addresses to, and it can be hundreds of different inbound
addresses, so there's nothing I can do in advance.   They've been doing
this for a while, and we're seeing outside compromised email accounts
causing these organization-unique addresses to get on spam lists.

SO, ultimately what I want to do is negatively score any email where the
userpart of the to address matches the second level of the domain name.
That won't help get the legitimate messages from people's personal gmail
get through, but it should help us ensure that messages from each org
that's sent to a matching to: address gets through, even if what they're
talking about might fail bayesian tests.

Before I show what I came up with, I've discovered an oddity.   in
BombHeaderRe, testing the following:

(MatchThisWord)=>-19

will show a negative 19 score when a match is made in the analyze GUI, as
expected

However doing any of thse

<<<(MatchThisWord)>>>=>-19
<<<MatchThisWord>>>=>-19
~<<<MatchThisWord>>>~=>-19   <<-- which I tried because I originally had an
or in there

will show me a score of positive 25, the bomb valance I have set.

*It seems like using <<< >>> to turn of regex optimization might break the
2nd parameter from being recognized.   Am I doing something very wrong or
is this a bug?*

And now for the regex I've come up with as a new starting point:

*DISCLAIMER TO ANYONE READING THIS IN THE FUTURE* - while this seems to
work for me, it's surely at least imperfect if not horribly inefficient or
even wrong or broken!!!

Here's the regex I've built.  It seems to work in ASSP and  test properly
at https://regextr.com with PCRE selected as the engine and the case
insensitive flag ticked

(?:^|\r?\n)(?:to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n|from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n)

This appears to match:

x-whatever: bla bla
to: "my name" <*ThirdParty2Level-Domain*@OurCharity.org>
subject: testing
from: "them" <whatever.e...@bla.bla.*ThirdParty2Level-Domain*.them>
asdf
and with the from appearing before the to.

I do not know of a way to make the order of to and from insignificant, so
I've done an "or" in between the first part of the regex which looks for to
then from and the second part which looks for from then to.   *Would it be
more efficient for ASSP to have 2 separate lines, one for to first the
other for from first?*

Here's my thinking and explanation of my understanding of the regex that I
wrote. I am VERY interested in corrections and suggestions for improvement,
especially relating to efficiency (and obviously flawed logic and/or cases
where what I've done would or wouldn't match as I'm thinking).  Guidance
here won't only help me perfect this specific regex for ASSP use, but will
hopefully help others looking for other more complex than typical regex
help with ASSP.  I'll definitely be limiting the to domains to those that
we use here to speed this up a bit, but I kept it more generic here.

I also tried to see a way where lookaheads might help, but I'm not quite
there yet....  Would they be helpful here?

Starting from the beginning:

(?:^|\r?\n)
start with either the start of the string or a \r?\n   - sometimes there's
a \r but always a \n     Is \r?\n recommended?  Is there a better way?

Then we're going to do 2 big OR's,  first looking for to then from, then
from then to.
(?: starts this big or, with the ?: indicating that it's a non-capturing
group

The TO then From part is this:
to:(?:.*?[\s\<])*?(?<TOFirstMatch>[a-z\d\-]+)\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n(.+\r?\n)*?from:.*?\@(?:[a-z\d\-]+\.)*?\g{TOFirstMatch}\.[a-z]{2,6}\>?\r?\n

broken out

to:  Find to:  immediately after the previously found newline or start of
string)

(?:.*?[\s\<])*?
non-capturing match for any characters repeated as long as they end with a
space or <

now we should be at the point where the username starts

(?<TheMatch>[a-z\d\-]+)\@
get a named match called TOFirstMatch for any a-z number - combination that
ends in the now escaped @

(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n
then just make sure that what follows the @ is a-z decimal and dahes, each
part ending in a . with a 2-6 letter TLD ending the hostname followed by an
optional > and then \n or \r to end the line

(?:.+\r?\n)*?
then ignore future lines which aren't blank until we a line starting with
from:

from:.*?
line stars with from: followed by any characters

\@(?:[a-z\d\-]+\.)*?
find @valid.sub. part of from address

\g{TOFirstMatch}
use the \g{} syntax to match the named backreference

\.[a-z]{2,6}\>?\r?\n)
immediately followed by .tld 2-6 characters in length, an optional > and a
\n or \r

|
then an OR

and we do the whole thing again but with From First
from:.*?\@(?:[a-z\d\-]+\.)*?(?<FROMFirstMatch>[a-z\d\-]+)\.[a-z]{2,6}\>?\r?\n(?:.+\r?\n)*?to:(?:.*?[\s\<])*?\g{FROMFirstMatch}\@(?:[a-z\d\-]+\.)+[a-z]{2,6}\>?\r?\n)

from:.*?
from: followed by anything until we hit

\@(?:[a-z\d\-]+\.)*?
and @ sign followed by any number of hostname followed by .

(?<FROMFirstMatch>[a-z\d\-]+)
find the second level domain name and call is FROMFirstMatch

\.[a-z]{2,6}\>?\r?\n
followed by a .tld of 2 to 6 characters, an optional closing > and a \n or
\r

(?:.+\r?\n)*?
move past non blank lines until we hit

to:(?:.*?[\s\<])*?
to: optionally followed by whatever characters ending in space or <

\g{FROMFirstMatch}\@
now look for the second level domain match from the from: line immediately
followed by an @ sign

(?:[a-z\d\-]+\.)+
then hostnames separated by dots, at least 1

[a-z]{2,6}\>?\r?\n)
followed by a 2-6 character tld, an optional > and a \n or \r?

)
closing out the or between the MatchToFirst and FROMFirstMatch sections.

Whew.
:

On Thu, Nov 4, 2021 at 4:53 AM Thomas Eckardt <thomas.ecka...@thockar.com>
wrote:

> forgot to say:
>
> if assp requires to capture the match for a regex, the code would be for
> example
>
> $string =~ /($testReRE)/
> $match = $1;
>
> so - at runtime the regex is
>
> ((?^u:(?is:(?:^|\n\r).*(searchstring).*@.*\1.*)))
>
> IMHO you need to use named capture groups or \g or (?|
>
> Thomas
>
>
>
> Von:        "Thomas Eckardt" <thomas.ecka...@thockar.com>
> An:        "ASSP development mailing list" <
> assp-test@lists.sourceforge.net>
> Datum:        04.11.2021 09:22
> Betreff:        Re: [Assp-test] RegEx Backreferences - the basics
> ------------------------------
>
>
>
> to make backreferences working, regex optimization must be switched off
> for the complete regex -> tested -> worked
>
> >I've seen posts here indicating that backreferencing matches is possible
> with an unoptimized expression.
>
> so - the problem is sitting in front of the monitor :):)
>
> m/(?is:(?:^|\n\r).*(?:searchstring)*.*@.*\1* <.*@.*%5C1> <-- HERE .*)/
>
> optimized - default is : 'no extra group capturing is allowed'
>
> >I've got to be missing something incredibly obvious.
>
> assp-do-not-optimize-regex
>
> >  (?:^|\n\r).*(searchstring).*@.*\1.*
>
> assp makes it:
>
> (?is:(?:^|\n\r).*(searchstring).*@.*\1.*)
>
> think about your regex - read it from left to right as 'perl regex engine'
> - what will happen?
> beside the other mistakes the @ should be escaped  \@ , because an ARRAY
> @. may exist
>
> >Regex101.com seems to confirm that this works.
>
> does not check perl pcre
>
> and if I read the explanation there, I sure it will not work like you
> expect
>
>
> Thomas
>
>
>
> Von:        "K Post" <nntp.p...@gmail.com>
> An:        "ASSP development mailing list" <
> assp-test@lists.sourceforge.net>
> Datum:        04.11.2021 02:29
> Betreff:        [Assp-test] RegEx Backreferences - the basics
> ------------------------------
>
>
>
> I've got nothing in my TestRe file except for a single line:
>
> ~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~
>
> The idea is to log any time there's a line that includes "searchstring" on
> the right and left of an @.  This is just a very rudimentary test because
> backreferences seem to error for me.  I would expect this to match
> searchstring@searchstring
> something else seachstring more @ whatever searchstring bla
> If "searchstring" is to the right and left of an @ sign, it should match.
> Regex101.com seems to confirm that this works.  Like I said, super basic.
>
> However, if I enter ~<<<(?:^|\n\r).*(searchstring).*@.*\1.*>>>~ as the
> only line in TestRe file, I get a warning in the log:
>
> - Reference to nonexistent group in regex; marked by <-- HERE in
> m/(?is:(?:^|\n\r).*(?:searchstring)*.*@.*\1* <.*@.*%5C1> <-- HERE .*)/
> - try using unoptimized regex
>
> To my understanding, the <<< >>> surround should turn of regex
> optimization for that line, which enables backreferencing (\1) to work and
> the ~ is required because there's an or in there.   Shouldn't the \1
> reference (searchstring) ?  I don't understand why assp thinks that \1 is a
> reference to a non-existent group.
>
> I also tried removing the <<< >>> and adding assp-do-not-optimize to the
> top of the TestRe file.  No difference.    No matter how simple I make the
> regex, even (.*)@\1,  it still complains about the invalid backreference.
>
>
> I've got to be missing something incredibly obvious.  I've read through
> the regex doc in docs, but that doesn't talk about backreferencing in ASSP
> and I can't find anything in the GUI that makes mention. I've seen posts
> here indicating that backreferencing matches is possible with an
> unoptimized expression.
>
> A shove in the right direction would be greatly appreciated.
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Re: [Assp-test] RegEx Backreferences - the basics

Reply via email to