Re: PDFinfo not returning expected producer, creator values

2022-03-04 Thread Kevin A. McGrail
I also want to mirror Bill's comment of a very detailed but report

On Fri, Mar 4, 2022, 18:05 Ricky Boone  wrote:

> Sorry for the late reply, crazy week.
>
> Honestly, I wasn't expecting such a quick and relevant response, so thanks
> and kudos for that.  :)
>
> I'm not currently using trunk, so I will try to patch in the changes
> described during a quiet period over the weekend.  It does look like that
> should do the trick, though.
>
> On Thu, Mar 3, 2022 at 1:48 AM Bill Cole <
> sausers-20150...@billmail.scconsult.com> wrote:
>
>> On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
>> Ricky Boone 
>> is rumored to have said:
>>
>> > If this is the wrong forum to report this, let me know.
>>
>> This is fine. I've also documented the fix in our Bugzilla at
>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960
>>
>> If you're running the 'trunk' version out of svn, the fix is in there.
>> We do not even have a target date for the next release, but we generally
>> do not break 'trunk' if you're feeling adventurous.
>>
>> If you're a different sort of adventurous, willing to hack on your local
>> copy of the code, the fix is to remove these lines (~223-224) which skip
>> lines based on an antique assumption:
>>
>> -  # lines containing high bytes will have no data we need, so save
>> some cycles
>> -  next if ($line =~ /[\x80-\xff]/);
>>
>> Thank you very much for the detailed analysis. I had seen this problem
>> on some PDFs but have not had the time to dive into the issue. You
>> vastly reduced the pain of fixing it.
>>
>>
>> > I'm trying to create a couple rules to identify questionable PDFs
>> > (phishing, etc.).  While evaluating the debug output from spamassassin
>> > for
>> > the pdfinfo plugin, I noticed that some of the test file attributes
>> > aren't
>> > being populated correctly, when comparing against exiftool, Adobe
>> > Reader,
>> > Firefox, etc.  The producer and creator fields, specifically, appear
>> > to be
>> > left as unknown.
>> >
>> > Compared against other emails and PDFs, I get similar results, so I
>> > suspect
>> > it's an issue with the plugin or how it is parsing the PDF.  I do have
>> > this
>> > example available, however it is malicious (it links to a phishing
>> > site),
>> > so I wouldn't want to link to it directly in this thread.
>> >
>> > For example:
>> >
>> > $ less Invoice0098539.pdf
>> > %PDF-1.4
>> > 1 0 obj
>> > <<
>> > /Title ()
>> > /Creator (^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@
>> > ^@0^@.^@1^@2^@.^@5)
>> > /Producer (^@Q^@t^@ ^@4^@.^@8^@.^@7)
>>
>> There's the cause. Apparently the use of UTF-16BE encoding with a
>> leading BOM for metadata was not so common when that plugin was written.
>> It saw the BOM and assumed the line was binary data.
>>
>>
>> --
>> Bill Cole
>> b...@scconsult.com or billc...@apache.org
>> (AKA @grumpybozo and many *@billmail.scconsult.com addresses)
>> Not Currently Available For Hire
>>
>


Re: PDFinfo not returning expected producer, creator values

2022-03-04 Thread Ricky Boone
Sorry for the late reply, crazy week.

Honestly, I wasn't expecting such a quick and relevant response, so thanks
and kudos for that.  :)

I'm not currently using trunk, so I will try to patch in the changes
described during a quiet period over the weekend.  It does look like that
should do the trick, though.

On Thu, Mar 3, 2022 at 1:48 AM Bill Cole <
sausers-20150...@billmail.scconsult.com> wrote:

> On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
> Ricky Boone 
> is rumored to have said:
>
> > If this is the wrong forum to report this, let me know.
>
> This is fine. I've also documented the fix in our Bugzilla at
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960
>
> If you're running the 'trunk' version out of svn, the fix is in there.
> We do not even have a target date for the next release, but we generally
> do not break 'trunk' if you're feeling adventurous.
>
> If you're a different sort of adventurous, willing to hack on your local
> copy of the code, the fix is to remove these lines (~223-224) which skip
> lines based on an antique assumption:
>
> -  # lines containing high bytes will have no data we need, so save
> some cycles
> -  next if ($line =~ /[\x80-\xff]/);
>
> Thank you very much for the detailed analysis. I had seen this problem
> on some PDFs but have not had the time to dive into the issue. You
> vastly reduced the pain of fixing it.
>
>
> > I'm trying to create a couple rules to identify questionable PDFs
> > (phishing, etc.).  While evaluating the debug output from spamassassin
> > for
> > the pdfinfo plugin, I noticed that some of the test file attributes
> > aren't
> > being populated correctly, when comparing against exiftool, Adobe
> > Reader,
> > Firefox, etc.  The producer and creator fields, specifically, appear
> > to be
> > left as unknown.
> >
> > Compared against other emails and PDFs, I get similar results, so I
> > suspect
> > it's an issue with the plugin or how it is parsing the PDF.  I do have
> > this
> > example available, however it is malicious (it links to a phishing
> > site),
> > so I wouldn't want to link to it directly in this thread.
> >
> > For example:
> >
> > $ less Invoice0098539.pdf
> > %PDF-1.4
> > 1 0 obj
> > <<
> > /Title ()
> > /Creator (^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@
> > ^@0^@.^@1^@2^@.^@5)
> > /Producer (^@Q^@t^@ ^@4^@.^@8^@.^@7)
>
> There's the cause. Apparently the use of UTF-16BE encoding with a
> leading BOM for metadata was not so common when that plugin was written.
> It saw the BOM and assumed the line was binary data.
>
>
> --
> Bill Cole
> b...@scconsult.com or billc...@apache.org
> (AKA @grumpybozo and many *@billmail.scconsult.com addresses)
> Not Currently Available For Hire
>


Re: how sendgrid is abusing the ukraine crisis (or they are still to dumb to filter for spam)

2022-03-04 Thread Greg Troxel

Bill Cole  writes:

> On 2022-03-04 at 09:18:08 UTC-0500 (Fri, 04 Mar 2022 09:18:08 -0500)
> Greg Troxel 
> is rumored to have said:
>
>> Greg Troxel  writes:
>>
>>> With stock scores, sendgrid gets
>>>
>>>  2.1 URIBL_GREY Contains an URL listed in the URIBL greylist
>>> [URIs: sendgrid.net]
>>>  1.5 KAM_SENDGRID   Sendgrid being exploited by scammers
>>>
>>> and I find 3.6 a bit much.

(sorry, URIBL_GREY is only 1.1, so that's 2.6 between them)

> Note that those are quasi-independent rules. URIBL looks at all of the
> URIs in a message. KAM_SENDGRID only hits mail transferred through
> Sendgrid where the From header and envelope sender addresses are in
> unrelated domains.

I meant only that I find that for this particular sender, both rules
hit.

> I may be wrong, but I do not believe that all Sendgrid ham will hit
> either of those rules, although much surely will hit both. The KAM
> rules don't go through QA that would reveal their overlap/independence
> as the stock rules do, so there's not a good way that I can check.

I am unclear on if KAM_SENDGRID is supposed to hit on legit mail from
sendgrid; it is for this particular class of ham.  It sounds like you
think at least some sendgrid ham will hit this.

Return-Path: seems like it matches __KAM_SENDGRID1A, Received looks like
it matches __KAM_SENDGRID2, and the From: is from the government
office's domain.

>>> But maybe 72% of what sendgrid sends is
>>> spam?  (Knowing the spam % is actually a serious question.)
>>
>> sorry, didn't quite get back to stock for that  test, so I think it's
>> only 1.1+1.5=2.6, so tuned for 52% spam...
>
> FWIW, that is NOT how the math works for score determination. Even for
> the stock rules which get programmatically adjusted as a set, that's
> not a "tuning" target that would be useful or even have a calculable
> solution.

Sorry, I do know that, but what I was trying to get at, and did so
badly, was that if a rule has a score of 2.5, then I would expect that a
fairly large amount of the messages that trigger it would be spam.
Otherwise, I would expect that score to be reduced by the tuning
algorithms.

> The rule score tuning doesn't really pay any attention to aggregate
> score values except in >/< relation to the threshold. If 100% of a
> sender's mail is ham that just happens to score 4.2, that's great. If
> it is 100% spam, all scoring 5.2, that's also great. If it is a 50/50
> mix that SA scores perfectly at either 4.2 or 5.2, that would be
> astoundingly good. Message scores do NOT have a score distribution
> that can be approximated by any combination of statistically useful
> distributions which could support the sort of score arithmetic you are
> positing.

I see your point but it would be interesting to see the %spam data (out
of some background ham/spam a priori rate) per rule, somehow in a
scatter plot with score.

Also given how things are, if ham scored 4.2 it would take very little
in terms of a 1-point rule or 2 x .5 rules triggering vs not to push it
over.  So while 4.2 is a good score for ham in the metrics, it's not in
my view a good score for a ham message viewed over the ensemble of other
things that are likely to happen.

All I'm really trying to say is that ham getting 2.5 from one rule moves
it halfway to threshold, where it gets marked as spam if the rest of the
rules give it >=2.5.

> I wish Justin had originally made the base score -5 and the threshold
> 0. It's 20 years too late to fix that, but it would have made it
> easier for people to avoid wrong mathematical assumptions about the
> value of the aggregate score of a message.

I do know how scores are determined for the base ruleset (and above you
said that the KAM scores aren't determined that way, I think).

And I know it's against doctrine, but I find that the odds of spam
change from near 0 at -2 to near 1 at >=4.  Just above about 2, its
roughly 50%, and it's not linear.  Because of that I treat 3 different
from <1, putting 3 in a maybe-spam folder not allowed to show up on my
phone.  I know that's not how SA's "was this message scored
correctly" is defined, but I find this sort of sorting very useful.

The message in question did actually get to 5.0.  I've tweaked scores,
up and down, so I know that doesn't technically count.


signature.asc
Description: PGP signature


Re: how sendgrid is abusing the ukraine crisis (or they are still to dumb to filter for spam)

2022-03-04 Thread Bill Cole
On 2022-03-04 at 09:18:08 UTC-0500 (Fri, 04 Mar 2022 09:18:08 -0500)
Greg Troxel 
is rumored to have said:

> Greg Troxel  writes:
>
>> With stock scores, sendgrid gets
>>
>>  2.1 URIBL_GREY Contains an URL listed in the URIBL greylist
>> [URIs: sendgrid.net]
>>  1.5 KAM_SENDGRID   Sendgrid being exploited by scammers
>>
>> and I find 3.6 a bit much.


Note that those are quasi-independent rules. URIBL looks at all of the URIs in 
a message. KAM_SENDGRID only hits mail transferred through Sendgrid where the 
From header and envelope sender addresses are in unrelated domains.

I may be wrong, but I do not believe that all Sendgrid ham will hit either of 
those rules, although much surely will hit both. The KAM rules don't go through 
QA that would reveal their overlap/independence as the stock rules do, so 
there's not a good way that I can check.

>> But maybe 72% of what sendgrid sends is
>> spam?  (Knowing the spam % is actually a serious question.)
>
> sorry, didn't quite get back to stock for that  test, so I think it's
> only 1.1+1.5=2.6, so tuned for 52% spam...

FWIW, that is NOT how the math works for score determination. Even for the 
stock rules which get programmatically adjusted as a set, that's not a "tuning" 
target that would be useful or even have a calculable solution.

The rule score tuning doesn't really pay any attention to aggregate score 
values except in >/< relation to the threshold. If 100% of a sender's mail is 
ham that just happens to score 4.2, that's great. If it is 100% spam, all 
scoring 5.2, that's also great. If it is a 50/50 mix that SA scores perfectly 
at either 4.2 or 5.2, that would be astoundingly good. Message scores do NOT 
have a score distribution that can be approximated by any combination of 
statistically useful distributions which could support the sort of score 
arithmetic you are positing.

I wish Justin had originally made the base score -5 and the threshold 0. It's 
20 years too late to fix that, but it would have made it easier for people to 
avoid wrong mathematical assumptions about the value of the aggregate score of 
a message.


-- 
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


signature.asc
Description: OpenPGP digital signature


address in from name, FromNameSpoof

2022-03-04 Thread Matus UHLAR - fantomas

Hello,

I got reports for multiple spams in form:

From: " martin.redact...@example.com"

To: "ředácted xyz, Ing." 
Subject: Fw:  xyz.redact...@example.com

(I intentionally kept some chars with diacritics because that was similar to 
unredacted addresses looked like)


I was trying to catch these with FromNameSpoof plugin:

header  L_FROMNAME_EMAILeval:check_fromname_contains_email()
header  L_FROMNAME_DIFFERENTeval:check_fromname_different()
header  L_FROMNAME_OWNERS_DIFFEReval:check_fromname_owners_differ()
header  L_FROMNAME_DOMAIN_DIFFEReval:check_fromname_domain_differ()
header  L_FROMNAME_SPOOFeval:check_fromname_spoof()
header  L_FROMNAME_EQUALS_TOeval:check_fromname_equals_to()
header  L_FROMNAME_EQUALS_REPLYTO   eval:check_fromname_equals_replyto()

neither of those nor any of _FNSFNAME*_ tags did hit
Am I expecting too much from FromNameSpoof?


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
2B|!2B, that's a question!


Re: how sendgrid is abusing the ukraine crisis (or they are still to dumb to filter for spam)

2022-03-04 Thread Alan
FWIW at least I've found them to be responsive to abuse reports, unlike 
Amazon SES.


On 2022-03-04 08:01, Marc wrote:

Is anyone blocking already connections from outbound-mail.sendgrid.net? Does 
that generate a lot of false positives?
PS. just posting this so it is on web archives and people searching for 
sendgrid hopefully chose a better service.


--
For SpamAssassin Users List



Re: how sendgrid is abusing the ukraine crisis (or they are still to dumb to filter for spam)

2022-03-04 Thread Alan Hodgson
On Fri, 2022-03-04 at 13:01 +, Marc wrote:
> Is anyone blocking already connections from outbound-
> mail.sendgrid.net? Does that generate a lot of false positives? 
> PS. just posting this so it is on web archives and people searching
> for sendgrid hopefully chose a better service.
> 

Unfortunately, a lot of legitimate senders still use Sendgrid.


Re: how sendgrid is abusing the ukraine crisis (or they are still to dumb to filter for spam)

2022-03-04 Thread Greg Troxel

Greg Troxel  writes:

> With stock scores, sendgrid gets
>
>  2.1 URIBL_GREY Contains an URL listed in the URIBL greylist
> [URIs: sendgrid.net]
>  1.5 KAM_SENDGRID   Sendgrid being exploited by scammers
>
> and I find 3.6 a bit much.  But maybe 72% of what sendgrid sends is
> spam?  (Knowing the spam % is actually a serious question.)

sorry, didn't quite get back to stock for that  test, so I think it's
only 1.1+1.5=2.6, so tuned for 52% spam...


signature.asc
Description: PGP signature


Re: how sendgrid is abusing the ukraine crisis (or they are still to dumb to filter for spam)

2022-03-04 Thread Greg Troxel

CC: trimmed as my message is not an abuse report.

You asked about outright blocking, but you didn't ask if people thought
that was wise.

I received a piece of ham today, and the received line added by my MTA is:

  Received: from o1678989x80.outbound-mail.sendgrid.net 
(o1678989x80.outbound-mail.sendgrid.net [167.89.89.80])

This was a legitimate message from an agency of a local government, and
solidly ham.

I'm not going to claim that sendgrid is or isn't ok -- I don't
personally have any data.But it's clear that at least one legitimate
entity uses them and that I receive some ham from them.

With stock scores, sendgrid gets

 2.1 URIBL_GREY Contains an URL listed in the URIBL greylist
[URIs: sendgrid.net]
 1.5 KAM_SENDGRID   Sendgrid being exploited by scammers

and I find 3.6 a bit much.  But maybe 72% of what sendgrid sends is
spam?  (Knowing the spam % is actually a serious question.)

I find ham misfiled as spam just due to sendgrid is fairly rare, and I
just welcomelist them.  So that's probably a clue that I get little ham
from sendgrid.

But an outright block doesn't seem like a good idea.  It certainly would
result in me losing ham.



signature.asc
Description: PGP signature


Re: how sendgrid is abusing the ukraine crisis (or they are still to dumb to filter for spam)

2022-03-04 Thread Benny Pedersen

On 2022-03-04 14:01, Marc wrote:

Is anyone blocking already connections from
outbound-mail.sendgrid.net? Does that generate a lot of false
positives?
PS. just posting this so it is on web archives and people searching
for sendgrid hopefully chose a better service.



first define better service