Re: ExtractText and docx

2021-05-06 Thread Peter West
If you have a JVM lying around, you can extract docx text with Apache Tika.


—
Peter West
p...@ehealth.id.au
“I am the vine; you are the branches.”

> On 7 May 2021, at 2:30 pm, John Hardin  wrote:
> 
> On Thu, 6 May 2021, Alex wrote:
> 
>> Hi,
>> 
>> I'm trying to use the latest ExtractText plugin, but the docx2txt
>> program the plugin references is no longer available from
>> http://docx2txt.sourceforge.net
> 
>> Do you have any recommendations for an alternative...?
> 
> Perhaps one of (from Stack Overflow):
> 
> unzip -p some.docx word/document.xml |\
>   sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
> 
> unzip -p document.docx word/document.xml |\
>   sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
> 
> unzip -p document.docx word/document.xml |\
>   sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'
> 
> ...though html2text might be better than sed for reliably de-XMLizing the 
> document text.
> 
> There's also this:
> 
>  http://abisource.com/downloads/wv/
> 
> There's conflicting information on whether Antiword groks .docx, you may want 
> to try it and see. It may be available from your distro, otherwise:
> 
>  http://www.winfield.demon.nl/index.html
> 
> It might be worthwhile to use native perl utilities to unzip the file, 
> extract the document.xml content and pass it through XML::XPath to extract 
> the text, but that would probably involve code changes to ExtractText rather 
> than just configuring an it to use external utility.
> 
> Caveat: I have never looked at the ExtractText plugin.
> 
> 
> -- 
> John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
> jhar...@impsec.org pgpk -a jhar...@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>  Are you a mildly tech-literate politico horrified by the level of
>  ignorance demonstrated by lawmakers gearing up to regulate online
>  technology they don't even begin to grasp? Cool. Now you have a
>  tiny glimpse into a day in the life of a gun owner.   -- Sean Davis
> ---
> 2 days until the 76th anniversary of VE day



Re: OT: Re: Unsubscribe link at the bottom.

2021-04-05 Thread Peter West
Yes. I meant the unsubscribe link from an unknown advertiser.
—
Peter West
p...@ehealth.id.au
“He has risen…”

> On 6 Apr 2021, at 12:50 pm, Grant Taylor  wrote:
> 
> On 4/5/21 8:41 PM, Peter West wrote:
>> I’d agree it’s address verification, as with the Unsubscribe link at the 
>> bottom.
> 
> I'm of the opinion that if I have any inclining of knowledge of the company 
> sending the email, and SPF/DKIM/DMARC pass, I'll probably use the unsubscribe 
> link.
> 
> Recently I ran into a 404 from the unsubscribe link from a company that my 
> wife did business with.  *facepalm*
> 
> 
> 
> -- 
> Grant. . . .
> unix || die
> 



Re: "Please send us a quote..."?

2021-04-05 Thread Peter West
I’d agree it’s address verification, as with the Unsubscribe link at the bottom.

—
Peter West
p...@ehealth.id.au
“He has risen…”

> On 6 Apr 2021, at 12:30 pm, Grant Taylor  wrote:
> 
> On 4/5/21 7:30 PM, John Hardin wrote:
>> Can anybody explain to me the reason behind the blind "please send us a 
> 
>> quote for your product X" emails? I mean, I know they are somehow a 
> 
>> scam, but I can't figure it out how it's supposed to work when the target 
>> isn't a business...
> 
> I chalk this up to list washing or similar address verification.
> 
> 
> 
> -- 
> Grant. . . .
> unix || die
> 



Re: Problem with local.cf rules

2021-03-16 Thread Peter West
The most pertinent stuff I found was this this Confluence page:
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/CachingNameserver 
<https://cwiki.apache.org/confluence/display/SPAMASSASSIN/CachingNameserver>

So it looks as though I have to install a primary nameserver and a secondary 
rbldnsd.

I’m trying to translate this –
Rsync the feed files into /var/lib/rbldnsd

which seems to be this set 
dul.dnsbl.sorbs.net:ip4set:dul.dnsbl.sorbs.net
http.dnsbl.sorbs.net:dnset:http.dnsbl.sorbs.net
smtp.dnsbl.sorbs.net:ip4set:smtp.dnsbl.sorbs.net
new.spam.dnsbl.sorbs.net:ip4set:new.spam.dnsbl.sorbs.net
dnsbl-1.uceprotect.net:ip4set:dnsbl-1.uceprotect.net 
<http://dnsbl-1.uceprotect.net/>

which is also dropped (for pdns-recursor) in forward-zones, like so
dul.dnsbl.sorbs.net=127.0.0.1:530
http.dnsbl.sorbs.net=127.0.0.1:530
smtp.dnsbl.sorbs.net=127.0.0.1:530
new.spam.dnsbl.sorbs.net=127.0.0.1:530
dnsbl-1.uceprotect.net=127.0.0.1:530

Apparently, and ip4set is a set of ip4 addresses, while a dnsset is a set of 
domain names.

I still don’t know how to translate –
Rsync the feed files into /var/lib/rbldnsd

And I don’t know whether I am supposed to rely only on sorbs + uceprotect, or 
whether I am supposed to somehow cobble similar sets together for Mailspike, 
SpamCop, Spamhaus ZEN, SURBL and URIBL (which circles me back to the original 
mail header notation which brought me here.)
See 
https://cwiki.apache.org/confluence/display/spamassassin/DnsBlocklists#dnsbl-block
 
<https://cwiki.apache.org/confluence/display/spamassassin/DnsBlocklists#dnsbl-block>

I am impressed by the level of obscurity, not to mention the sprawling vastness 
of spamassassin.

Further assistance is needed.

—
p...@ehealth.id.au
“…an hour is coming when all who are in the tombs will hear his voice and come 
out…”

> On 15 Mar 2021, at 1:29 am, John Hardin  wrote:
> 
> On Sun, 14 Mar 2021, jwmi...@gmail.com wrote:
> 
>> Peter West writes:
>> 
>> And You might want to fix the URIBL_BLOCKED issue.  Fixing the
>> URIBL_BLOCKED issue will do far more to fix your issues than adding
>> rules.
> 
> Seconded. The keywords here are "local, caching, *NON-FORWARDING* DNS server 
> for SpamAssassin".
> 
> If that isn't enough to set you on the right path, search the mailing list 
> archives for "URIBL-BLOCKED" or "URIBL DNS" for previous discussions of this 
> topic. If that history isn't enough, feel free to ask for assistance.
> 
> -- 
> John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
> jhar...@impsec.org pgpk -a jhar...@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>  Failure to plan ahead on someone else's part does not constitute
>  an emergency on my part. -- David W. Barts in a.s.r
> ---
> Today: Daylight Saving Time begins in U.S. - Spring Forward



Re: Problem with local.cf rules

2021-03-14 Thread Peter West
Well, that was simple. Thank you. What’s the default value of a rule? Does it 
have one?
—
p...@ehealth.id.au
“Two men went up into the temple to pray, one a Pharisee and the other a tax 
collector.”

> On 14 Mar 2021, at 11:41 pm, Alex Woick  wrote:
> 
> Peter West schrieb am 14.03.2021 um 14:30:
>> header CASINO From =~ /\bcasino\b/i
>> score 100.0
>> 
>> ===
>> 
>> 
>> It’s hitting the CASINO rule, but no matter what valoue I assign to the 
>> casino rules - 5, 20 , 100, these messages always come through with a value 
>> of 4.1. It’s as though some toerh rule is resetting the score to 0 before 
>> proceeding.
> You need to tell the rule name with the score keyword, otherwise spamassassin 
> cannot know to which rule it should set the score.
> 
> score CASINO 100
> 



Problem with local.cf rules

2021-03-14 Thread Peter West
I’m running spamassassin 3.4.2-0 in ubuntu 18.04.4.

Controlling process is
/usr/bin/perl -T -w /usr/sbin/spamd -d --pidfile=/var/run/spamd.pid 
--create-prefs --max-children 5 --helper-home-dir


My local.cf has local rules enabled, and contains, inter alia, these rules
=

header CASINO From =~ /\bcasino\b/i
score 100.0

header CASINOS From =~ /\bcasinos\b/i
score 100.0

header CASINO_DONOVAN From =~ /\bray donovan\b/i
score 100.0

header CASINO_OLIVIA From =~ /\bolivia.*cs\b/i
score 100.0

header BAD_WORDS_1 From =~ 
/\b(swimming|solarbank|bag|intelligent|napkin|stretcher)\b/i
score 6.0

header BAD_WORDS_2 From =~ 
/\b(smart|amazing|clavicle|slicer|indestructible|bamboo)\b/i
score 6.0

header BAD_WORDS_3 From =~ 
/\b(innovation|selfie|socks|healthreporters|thermovest)\b/i
score 6.0

header BAD_WORDS_4 From =~ 
/\bdrone\b|\bremover\b|\btrainer\b|\btactical\b|\bsmart watch\b/i
score 6.0

header BAD_WORDS_5 From =~ /(\blost\b.*[0-9]+.*lbs\b)/i
score 10.0

header BAD_WORDS_6 From =~ /\bdrone\b|\bprofessional\b|\bslim\b|\bmini\b/i

header AUSPOST_GOOD From =~ /auspost\.com\.au/
score -20.0

header AUSPOST_BAD From =~ /Australia Post/
score 20.0
===

The casinao stuff is still getting through. Here’s the X-Spam-Status on a 
typical message.

X-Spam-Status: No, score=4.1 required=5.0 tests=CASINO,DKIM_SIGNED,DKIM_VALID,
DKIM_VALID_AU,HTML_MESSAGE,MAILING_LIST_MULTI,RAZOR2_CF_RANGE_51_100,
RAZOR2_CHECK,SPF_HELO_PASS,SPF_PASS,URIBL_BLOCKED shortcircuit=no
autolearn=no autolearn_force=no version=3.4.2

It’s hitting the CASINO rule, but no matter what valoue I assign to the casino 
rules - 5, 20 , 100, these messages always come through with a value of 4.1. 
It’s as though some toerh rule is resetting the score to 0 before proceeding.

My other query concerns the AUSPOST rules. What I want to do is elminate mail 
that has a name of “AUSTRALIA POST” and does NOT have an address containing 
. Hence I’m trying the -20.0 +20.0 pair of rules. Is there a 
more direct way of achieving this? Will a pcre ’not followed by’ style of rule 
do the trick?

Is there a finer subdivision of the From header; into name and address, for 
example.

peter

—
p...@ehealth.id.au
“Two men went up into the temple to pray, one a Pharisee and the other a tax 
collector.”