Re: ExtractText and docx
If you have a JVM lying around, you can extract docx text with Apache Tika. — Peter West p...@ehealth.id.au “I am the vine; you are the branches.” > On 7 May 2021, at 2:30 pm, John Hardin wrote: > > On Thu, 6 May 2021, Alex wrote: > >> Hi, >> >> I'm trying to use the latest ExtractText plugin, but the docx2txt >> program the plugin references is no longer available from >> http://docx2txt.sourceforge.net > >> Do you have any recommendations for an alternative...? > > Perhaps one of (from Stack Overflow): > > unzip -p some.docx word/document.xml |\ > sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' > > unzip -p document.docx word/document.xml |\ > sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g' > > unzip -p document.docx word/document.xml |\ > sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g' > > ...though html2text might be better than sed for reliably de-XMLizing the > document text. > > There's also this: > > http://abisource.com/downloads/wv/ > > There's conflicting information on whether Antiword groks .docx, you may want > to try it and see. It may be available from your distro, otherwise: > > http://www.winfield.demon.nl/index.html > > It might be worthwhile to use native perl utilities to unzip the file, > extract the document.xml content and pass it through XML::XPath to extract > the text, but that would probably involve code changes to ExtractText rather > than just configuring an it to use external utility. > > Caveat: I have never looked at the ExtractText plugin. > > > -- > John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ > jhar...@impsec.org pgpk -a jhar...@impsec.org > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > --- > Are you a mildly tech-literate politico horrified by the level of > ignorance demonstrated by lawmakers gearing up to regulate online > technology they don't even begin to grasp? Cool. Now you have a > tiny glimpse into a day in the life of a gun owner. -- Sean Davis > --- > 2 days until the 76th anniversary of VE day
Re: OT: Re: Unsubscribe link at the bottom.
Yes. I meant the unsubscribe link from an unknown advertiser. — Peter West p...@ehealth.id.au “He has risen…” > On 6 Apr 2021, at 12:50 pm, Grant Taylor wrote: > > On 4/5/21 8:41 PM, Peter West wrote: >> I’d agree it’s address verification, as with the Unsubscribe link at the >> bottom. > > I'm of the opinion that if I have any inclining of knowledge of the company > sending the email, and SPF/DKIM/DMARC pass, I'll probably use the unsubscribe > link. > > Recently I ran into a 404 from the unsubscribe link from a company that my > wife did business with. *facepalm* > > > > -- > Grant. . . . > unix || die >
Re: "Please send us a quote..."?
I’d agree it’s address verification, as with the Unsubscribe link at the bottom. — Peter West p...@ehealth.id.au “He has risen…” > On 6 Apr 2021, at 12:30 pm, Grant Taylor wrote: > > On 4/5/21 7:30 PM, John Hardin wrote: >> Can anybody explain to me the reason behind the blind "please send us a > >> quote for your product X" emails? I mean, I know they are somehow a > >> scam, but I can't figure it out how it's supposed to work when the target >> isn't a business... > > I chalk this up to list washing or similar address verification. > > > > -- > Grant. . . . > unix || die >
Re: Problem with local.cf rules
The most pertinent stuff I found was this this Confluence page: https://cwiki.apache.org/confluence/display/SPAMASSASSIN/CachingNameserver <https://cwiki.apache.org/confluence/display/SPAMASSASSIN/CachingNameserver> So it looks as though I have to install a primary nameserver and a secondary rbldnsd. I’m trying to translate this – Rsync the feed files into /var/lib/rbldnsd which seems to be this set dul.dnsbl.sorbs.net:ip4set:dul.dnsbl.sorbs.net http.dnsbl.sorbs.net:dnset:http.dnsbl.sorbs.net smtp.dnsbl.sorbs.net:ip4set:smtp.dnsbl.sorbs.net new.spam.dnsbl.sorbs.net:ip4set:new.spam.dnsbl.sorbs.net dnsbl-1.uceprotect.net:ip4set:dnsbl-1.uceprotect.net <http://dnsbl-1.uceprotect.net/> which is also dropped (for pdns-recursor) in forward-zones, like so dul.dnsbl.sorbs.net=127.0.0.1:530 http.dnsbl.sorbs.net=127.0.0.1:530 smtp.dnsbl.sorbs.net=127.0.0.1:530 new.spam.dnsbl.sorbs.net=127.0.0.1:530 dnsbl-1.uceprotect.net=127.0.0.1:530 Apparently, and ip4set is a set of ip4 addresses, while a dnsset is a set of domain names. I still don’t know how to translate – Rsync the feed files into /var/lib/rbldnsd And I don’t know whether I am supposed to rely only on sorbs + uceprotect, or whether I am supposed to somehow cobble similar sets together for Mailspike, SpamCop, Spamhaus ZEN, SURBL and URIBL (which circles me back to the original mail header notation which brought me here.) See https://cwiki.apache.org/confluence/display/spamassassin/DnsBlocklists#dnsbl-block <https://cwiki.apache.org/confluence/display/spamassassin/DnsBlocklists#dnsbl-block> I am impressed by the level of obscurity, not to mention the sprawling vastness of spamassassin. Further assistance is needed. — p...@ehealth.id.au “…an hour is coming when all who are in the tombs will hear his voice and come out…” > On 15 Mar 2021, at 1:29 am, John Hardin wrote: > > On Sun, 14 Mar 2021, jwmi...@gmail.com wrote: > >> Peter West writes: >> >> And You might want to fix the URIBL_BLOCKED issue. Fixing the >> URIBL_BLOCKED issue will do far more to fix your issues than adding >> rules. > > Seconded. The keywords here are "local, caching, *NON-FORWARDING* DNS server > for SpamAssassin". > > If that isn't enough to set you on the right path, search the mailing list > archives for "URIBL-BLOCKED" or "URIBL DNS" for previous discussions of this > topic. If that history isn't enough, feel free to ask for assistance. > > -- > John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ > jhar...@impsec.org pgpk -a jhar...@impsec.org > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > --- > Failure to plan ahead on someone else's part does not constitute > an emergency on my part. -- David W. Barts in a.s.r > --- > Today: Daylight Saving Time begins in U.S. - Spring Forward
Re: Problem with local.cf rules
Well, that was simple. Thank you. What’s the default value of a rule? Does it have one? — p...@ehealth.id.au “Two men went up into the temple to pray, one a Pharisee and the other a tax collector.” > On 14 Mar 2021, at 11:41 pm, Alex Woick wrote: > > Peter West schrieb am 14.03.2021 um 14:30: >> header CASINO From =~ /\bcasino\b/i >> score 100.0 >> >> === >> >> >> It’s hitting the CASINO rule, but no matter what valoue I assign to the >> casino rules - 5, 20 , 100, these messages always come through with a value >> of 4.1. It’s as though some toerh rule is resetting the score to 0 before >> proceeding. > You need to tell the rule name with the score keyword, otherwise spamassassin > cannot know to which rule it should set the score. > > score CASINO 100 >
Problem with local.cf rules
I’m running spamassassin 3.4.2-0 in ubuntu 18.04.4. Controlling process is /usr/bin/perl -T -w /usr/sbin/spamd -d --pidfile=/var/run/spamd.pid --create-prefs --max-children 5 --helper-home-dir My local.cf has local rules enabled, and contains, inter alia, these rules = header CASINO From =~ /\bcasino\b/i score 100.0 header CASINOS From =~ /\bcasinos\b/i score 100.0 header CASINO_DONOVAN From =~ /\bray donovan\b/i score 100.0 header CASINO_OLIVIA From =~ /\bolivia.*cs\b/i score 100.0 header BAD_WORDS_1 From =~ /\b(swimming|solarbank|bag|intelligent|napkin|stretcher)\b/i score 6.0 header BAD_WORDS_2 From =~ /\b(smart|amazing|clavicle|slicer|indestructible|bamboo)\b/i score 6.0 header BAD_WORDS_3 From =~ /\b(innovation|selfie|socks|healthreporters|thermovest)\b/i score 6.0 header BAD_WORDS_4 From =~ /\bdrone\b|\bremover\b|\btrainer\b|\btactical\b|\bsmart watch\b/i score 6.0 header BAD_WORDS_5 From =~ /(\blost\b.*[0-9]+.*lbs\b)/i score 10.0 header BAD_WORDS_6 From =~ /\bdrone\b|\bprofessional\b|\bslim\b|\bmini\b/i header AUSPOST_GOOD From =~ /auspost\.com\.au/ score -20.0 header AUSPOST_BAD From =~ /Australia Post/ score 20.0 === The casinao stuff is still getting through. Here’s the X-Spam-Status on a typical message. X-Spam-Status: No, score=4.1 required=5.0 tests=CASINO,DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HTML_MESSAGE,MAILING_LIST_MULTI,RAZOR2_CF_RANGE_51_100, RAZOR2_CHECK,SPF_HELO_PASS,SPF_PASS,URIBL_BLOCKED shortcircuit=no autolearn=no autolearn_force=no version=3.4.2 It’s hitting the CASINO rule, but no matter what valoue I assign to the casino rules - 5, 20 , 100, these messages always come through with a value of 4.1. It’s as though some toerh rule is resetting the score to 0 before proceeding. My other query concerns the AUSPOST rules. What I want to do is elminate mail that has a name of “AUSTRALIA POST” and does NOT have an address containing . Hence I’m trying the -20.0 +20.0 pair of rules. Is there a more direct way of achieving this? Will a pcre ’not followed by’ style of rule do the trick? Is there a finer subdivision of the From header; into name and address, for example. peter — p...@ehealth.id.au “Two men went up into the temple to pray, one a Pharisee and the other a tax collector.”