Re: [Pan-devel] You need to incorporate Pre-processing tool (filters) to control downloading

Duncan Sun, 07 Sep 2014 02:21:13 -0700

Rajib Bandopadhyay posted on Sat, 06 Sep 2014 20:50:10 +0530 as excerpted:

> I have used PAN in the past and was impressed by its ease of use.
> 
> However, there is a great disadvantage for PAN - it doesn't have a
> pre-processing filter like Claws Mail.


[You originally posted to the pan-devel list.  However, pan-user has more 
traffic and the following subthread is likely both more discoverable by 
new users and available to existing regulars if it's posted to pan-user, 
so I'm posting this reply to both, with followup to pan-user (only) 
requested.]

Being a user of both pan and claws-mail, I can say that pan has 
functionality conceptually as powerful as that of at least claws-mail 
built-in filtering functionality.  However, it (1) works differently and 
perhaps not as immediately intuitively, (2) does not have claws-mail's 
script-your-own extensibility (tho in the higher performance multi-
threaded pan environment the only real effective way to do that would be 
native code or at least JIT-compiled bytecode in any case, and of course 
those with the ability to code can already do that by patching pan's own 
code directly), and (3) there's potential to make it even more powerful, 
but one bug and lack of a couple feature extensions currently limit it a 
bit.

I am of course speaking of pan's scoring functionality combined with the 
(relatively) new feature, (automated) score-based "actions".  More below.

> This pre-processing filter empowers us readers to pre-filter files we
> need to download, and we can control, selectively download, or even bar
> downloads, with pre-processing filters.
> 
> This pre-processing filter makes it a great tool. But it has its
> weaknesses, its downloads are single-threaded and hence, very slow.
> 
> You have a scope here to improve upon your design!

As I said above, pan's comparable feature combination is scoring combined 
with actions.  But there's a big conceptual difference in how they're 
implemented in pan, for at least two reasons:

1) The nature the news protocol and typical user task requirements.

2) The requirements of pan's high-performance multi-threaded environment.

Unlike claws-mail which is designed with single-threaded user-scripted-
extensibility as a very high priority, pan's automation focus is on 
higher-volume multi-threaded binary-post downloading and attachment 
saving.  Certainly both clients are designed to be usable for BOTH 
binaries and text.  But...

A primary focus of claws-mail is extreme user-scripted extensibility in a 
lower performance single-threaded environment with an assumption that 
actually reading message text is the primary reason for downloading, such 
that lower-performance single-threaded execution, with single-threaded 
"we can stop and wait for the extension-script result before proceeding", 
works just fine.  Efficient download and saving of binary attachments is 
certainly possible, but it's clearly lower on the priority scale than 
simple reading of primarily text messages, and full user-scripted 
extensibility, even at the cost of efficiency and effectively locking 
processing to single-thread.

OTOH...

A primary focus of pan is on high performance multi-threaded download and 
saving of binary attachments, with clear emphasis on "high performance" 
and "multi-threaded", and with no assumption that the text message itself 
is of any interest at all as it may in fact be ONLY the saved binary 
attachments that are of interest.  Clearly pan can reasonably effectively 
handle reading and replying to text messages as well and that in fact is 
the primary use-case of many users including myself, but pan's emphasis 
on efficient multi-threaded downloading means the claws-mail "we can and 
will stop everything and wait for the result of a user-script extension 
before proceeding" approach is simply not possible and ENTIRELY out of 
the question.

That's also the reason for pan's distinction between "caching" and 
"downloading", and why pan's local message cache is relatively small (10 
MB IIRC) by default.

* "Downloading" in pan terms means caching whole multi-individual-text-
post messages (where 100 or more such individual text messages are often 
entirely transparently combined to allow the next step), automatically 
extracting and saving off the attachments, then deleting the still 
entirely unread text messages from cache in ordered to make room for the 
next batch of individual text messages to be cached and their attachments 
saved off.  By this definition ONLY the often rather large binary 
attachments matter, the text messages themselves are simply the container 
it attachments ship in and can be discarded after the attached contents 
are safely unpacked and stored.

* By contrast, "caching" in pan terms is what claws-mail would call 
"downloading", that is, downloading the text message to cache, leaving it 
marked unread, ready for the user to read and save or reply to later as 
desired.  This "text mode message processing", as opposed to "binary mode 
attachment processing", is conceptually an /entirely/ different work-flow 
where high efficiency multi-thread bulk-processing isn't as vital, but 
user-scripted extensibility such as claws-mail offers for this use-case 
might be.  Pan's default 10 MB cache is rather small for this, and 
indeed, my text-instance[1] pan is configured as unexpiring with multi-
gig cache so it doesn't delete anything, and I have the entire content of 
the couple dozen or so text groups I follow (including the pan lists, as 
newsgroups via gmane.org's list2news service), minus a few spam and troll 
posts, going back over ten years in several cases.  =:^)  I even have the 
contents of several of my ISP's old discussion newsgroups still cached, 
even tho they killed their news server some years ago now so there's 
actually no server left to connect to for new messages (tho a few of the 
former ISP-private groups live on as pretty much entirely spam-filled 
zombie groups on various commercial NSP services).

It's important to get this difference, because back in the day when I had 
first come to pan on Linux from MSOE on MS Windows, I was originally 
rather dismayed to see messages I had "downloaded" but only to cache, 
believing them to be saved for reading and further processing later, 
suddenly disappearing again, still unread, as I downloaded (to cache) 
additional messages!  The cache was still set to its default 10 MB, and I 
had 10 MB of messages downloaded, so pan was simply deleting the oldest 
ones from cache in ordered to make way for new ones.  Once I figured out 
what was happening, I was able to set a far bigger cache (tho it was 
limited to I think 1 GB or some such back then, no such limits now), and 
my messages quit disappearing from cache before I even had a chance to 
read them!

The point being, pan ASSUMES binary mode download, attachment saving and 
unread containing-text-message discard by default, and is optimized to 
process that as efficiently as it can.  While it can /do/ text messages 
and in fact isn't actually a bad text-message news client at all, so much 
so that many folks including me actually use pan primarily for such text-
mode messages and groups, that's not what it assumes or is optimized for, 
just as claws-mail doesn't assume nor is it optimized for the bulk-binary-
mode attachment saving pan handles far more efficiently, even tho it can 
handle it in its slower single-threaded fashion.  (Tho I don't actually 
know if claws-mail handles the more efficient yenc-encoded attachments 
common on binary newsgroups  or not, but if it doesn't do so directly, 
there's certainly third-party-utilities that can do so, and claws-mail is 
certainly extensible enough to add that functionality as a third-party-
script-extension if desired.)


OK, now that we've dealt with the concepts underlying the practical, 
let's move on to their practical application in the context of your 
post.  Again, pan's implementation is scoring combined with actions.  
We'll deal with them one at a time.

1) Scoring

Pan's scoring system can work in one of two modes, absolute or 
incremental.  In fact, pan's ignore and watch features are implemented 
simply enough as the extreme ends of absolute-mode scoring, -9999 (or 
lower) scores are ignored, (+)9999 (or higher) scores are interpreted as 
watched, and pan's ignore and watch features simply create scoring rules 
that set =-9999 and =9999 absolute scores.

But in the absence of an applicable absolute (aka forced) score, pan will 
match any incremental mode scores that apply and the resulting score is 
the total of all incremental scoring rules.

Further, there are several scoring "zones" in addition to the two 
extremes.  If you have the score column active in your headers pane/tab, 
pan will color-code the scored posts by zone as configured on the colors 
tab of prefs, and can be set to show or hide posts by score zone as well, 
as accessed via the "match scores" section in the view, header pane 
submenu.  As can be seen from both places, the scoring zones are as 
follows, lowest to highest:

-9999   (and lower):    ignored

-9998   to      -1:     low (negative)

0:                      normal (zero/neutral, no scores apply or the 
effect of multiple scores combined is as if none applied at all)

1       to      4999:   medium

5000    to      9998:   high

9999    (and higher):   watched

These scoring zones are critical to the automated actions discussed 
below, but before we get to them, let's talk a bit more about how the 
scoring rules actually work.

In general, pan's scoring rules are stored in a scorefile with a format 
based that of another news client, slrn, altho pan's implementation isn't 
as advanced as that of slrn.  FWIW, at least one other news client, xnews 
(MS platform news client), uses a very similar scorefile format.

Here's the slrn scorefile.txt documentation:

http://slrn.sourceforge.net/docs/score.txt

Again, keep in mind while reading that, that pan's implementation is very 
similar, but not as advanced.  In particular, pan lacks support for the 
include directive, as well as for nested/grouped rules.  An additional 
difference is that pan's processing is case insensitive.

And one additional difference, presently, pan's scoring only supports 
logical OR: if ANY of the conditions match, the scoring rule is applied.  
As documented, logical OR is Score:: (double colon), while Score: (single 
colon /should/ be logical AND (only apply if ALL conditions match).  
Presently, however, Score: seems to behave like Score::, they both are 
treated as logical OR and ANY matching condition triggers the score.

I'm not sure but I /believe/ this to be a bug as I could almost swear 
that logical AND (single colon) *USED* to work.  But someone posted that 
they couldn't get it working and I tested and sure enough, all my scores 
were being applied as logical OR as well, so either it broke somewhere 
along the line or I'm mis-remembering and it never worked in the first 
place.

Tho it can be noted that with pan's scoring zones and appropriate use of 
incremental scoring, /almost/ the same effect can be achieved by simply 
adjusting the score values of the various individual elements composing 
the would-be AND, such that selected posts only fall in the desired score-
zone if ALL the appropriate conditions match, otherwise they'll fall into 
a different zone due to failure to match some of the conditions that 
incrementally combine to put it in the target zone.


Meanwhile, it's also worthwhile to specifically point out that as pan 
creates the scores, MOST of lines pan actually writes into the scorefile 
are actually either blank lines or comments, due to the leading % comment 
indicator.  You can thus trim all the %BOS and %EOS lines, etc, without 
affecting actual scoring functionality at all, as they're comments, there 
simply to clarify what pan was actually doing when it wrote the score, 
add date information, etc.

Also, note the overview-headers-only recommendation, which applies to pan 
as well as slrn.  In particular, while scoring can be done on any header, 
if the header isn't in the overview file, it's likely the whole post will 
have to be downloaded before the score can be applied.  While for ignored 
posts especially that can still be better than having to actually see and 
deal with the post manually, it does mean it has to be downloaded to 
cache before the score can be applied, so if at all possible, it's MUCH 
better to score on overview-included headers only, thus avoiding the 
download entirely.


Finally, here's a link (via gmane) to an earlier post of mine, with an 
except from my own scorefile (and some additional explanation/commentary) 
as an example of what a nicely organized hand edited scorefile can look 
like in practice.

http://permalink.gmane.org/gmane.comp.gnome.apps.pan.user/8689


OK, that covers scoring, but other than showing or hiding posts and/or 
making their scores show up in pretty colors in the header pane's score 
column, of what practical USE are they?  In particular, how can they be 
used to trigger automated pre-download filtering and selective download 
or delete-before-download?  That's where actions come in! =:^)

2) Actions

Once you understand how scoring works and master the art of writing good 
scoring rules, pan's still relatively new (automated) actions feature 
makes putting those scoring rules to practical use actually quite simple. 
=:^)

Actions are configured in pan preferences on the actions tab, and are 
scoring-zone based.  Depending on how much you want to rely on scoring to 
determine what's automatically processed, there are several suggested 
configurations possible.  I'd recommend NOT setting automated delete or 
even mark-read just yet, until you've watched how your scoring config is 
working and are comfortable that it's working as intended and you're not 
going to be missing a whole lot of posts due to accidental ignore-score 
matches.

In fact, here's what I recommend:

Before setting up actions at all, do this:

1) In pan prefs, headers tab:

Ensure that you have the score column enabled, and order it so it's in 
view all the time.

Back in the main window, header pane/tab, expand or shrink the score and 
other columns as necessary to fit, while keeping the score column in view.

2) In the view menu, header pane submenu:

Ensure that ALL the "match scores of" options, including "low" and 
"ignored", are checked.

3) Back in pan prefs, on the colors tab under header pane:

Setup your colors for each score zone so you can tell the zones apart 
just by color.

In particular, make sure ignored and low/negative score zone colors stand 
out, as well as watched and if you eventually intend to auto-download 
them, medium and high score zones as well.

4) In pan prefs, actions tab:

Ensure that all actions are currently DISABLED, for testing mode.


Now go back to using pan, setting up scores to put posts in the desired 
score zones as appropriate, and watching that the scores work as 
intended.  In particular, be sure the low/negative and ignored zones 
aren't catching posts that you actually want to see.


5) After some time watching that posts are getting assigned to their 
intended score zones, when you're comfortable that they are...

6) Back in pan prefs, on the actions tab:

Enable actions for score zones as appropriate.  Here's a recommended 
example:

Delete articles scoring at:     -9999 or less (ignored)

This is optional.  If you're conservative, you might wish to keep this 
disabled instead.

Mark articles read scoring at:  -9998 to -1 (low)

Alternatively, if you're not deleting ignored articles, you can set it to 
simply mark-read ignored.

Assuming you have pan set to hide read posts, this will hide them, but 
the headers won't actually be deleted, so you can still set show read 
posts temporarily if you want to refer back to them, perhaps because 
someone (not scored so low so you see the post) referred to them in a 
quote and you want to read the entire post to get the context.


Note that the above "negative actions" should work at the set level AND 
BELOW.  So if you for instance set mark-read from low-zone, it should 
mark-read ignored-zone as well.

The below "positive actions" should work the other way, at the set level 
AND ABOVE.  So if you set cache articles scoring medium-zone, it should 
also cache those in the high and watched zones.


Depending on whether you run pan in binary download-and-save-attachments 
or cache-and-process-later modes, and whether you want to auto-cache/
download medium and high scorezone posts or only watched posts, you can 
set these as appropriate, but:

Read-text-mode example:

Cache articles scoring at:      5000 to 9998 (high)

Download-binary-mode example:

Download attachments of articles scoring at:    9999 or more (watched)

For the download example, after you're sure it's downloading as 
appropriate, you likely want to set the mark affected articles read 
option as well, since once the attachments are downloaded (and saved) you 
probably don't care to see them any longer.


I believe once you have both scoring and actions setup appropriately, 
you'll find it does what you need quite well.  Pre-processing filters?  
Why?  That would only slow pan down!  =:^)

Tho since you posted to the dev list there's a reasonable chance that you 
can code in the C++ pan's written in, and if so, I'm sure no one would 
object if you found time to figure out why pan's AND scoring doesn't work 
and fix that, and if we're /really/ lucky, we might even get a patch for 
the missing include and/or nested/grouped scoring condition support, as 
well.  Then pan's scoring support would /really/ rock! =:^)

Several devs have in the past cloned pan's git repo to their github 
accounts and requested pulls from there when they have patches ready. =:^)

---
[1] My text-instance:  I run several separate pan instances each with its 
own config and cache, one for text, one for binaries, and one for 
temporary testing.  Pan reads the PAN_HOME environmental var and uses 
that for its config and data if set, using the default (~/.pan2 on *ix 
anyway, I don't do windows).  I use that in a wrapper script to point pan 
at the appropriate config depending on whether launched the bin, text or 
test wrapper.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-devel mailing list
Pan-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-devel

Re: [Pan-devel] You need to incorporate Pre-processing tool (filters) to control downloading

Reply via email to