Rajib Bandopadhyay posted on Sat, 06 Sep 2014 20:50:10 +0530 as excerpted: > I have used PAN in the past and was impressed by its ease of use. > > However, there is a great disadvantage for PAN - it doesn't have a > pre-processing filter like Claws Mail.
[You originally posted to the pan-devel list. However, pan-user has more traffic and the following subthread is likely both more discoverable by new users and available to existing regulars if it's posted to pan-user, so I'm posting this reply to both, with followup to pan-user (only) requested.] Being a user of both pan and claws-mail, I can say that pan has functionality conceptually as powerful as that of at least claws-mail built-in filtering functionality. However, it (1) works differently and perhaps not as immediately intuitively, (2) does not have claws-mail's script-your-own extensibility (tho in the higher performance multi- threaded pan environment the only real effective way to do that would be native code or at least JIT-compiled bytecode in any case, and of course those with the ability to code can already do that by patching pan's own code directly), and (3) there's potential to make it even more powerful, but one bug and lack of a couple feature extensions currently limit it a bit. I am of course speaking of pan's scoring functionality combined with the (relatively) new feature, (automated) score-based "actions". More below. > This pre-processing filter empowers us readers to pre-filter files we > need to download, and we can control, selectively download, or even bar > downloads, with pre-processing filters. > > This pre-processing filter makes it a great tool. But it has its > weaknesses, its downloads are single-threaded and hence, very slow. > > You have a scope here to improve upon your design! As I said above, pan's comparable feature combination is scoring combined with actions. But there's a big conceptual difference in how they're implemented in pan, for at least two reasons: 1) The nature the news protocol and typical user task requirements. 2) The requirements of pan's high-performance multi-threaded environment. Unlike claws-mail which is designed with single-threaded user-scripted- extensibility as a very high priority, pan's automation focus is on higher-volume multi-threaded binary-post downloading and attachment saving. Certainly both clients are designed to be usable for BOTH binaries and text. But... A primary focus of claws-mail is extreme user-scripted extensibility in a lower performance single-threaded environment with an assumption that actually reading message text is the primary reason for downloading, such that lower-performance single-threaded execution, with single-threaded "we can stop and wait for the extension-script result before proceeding", works just fine. Efficient download and saving of binary attachments is certainly possible, but it's clearly lower on the priority scale than simple reading of primarily text messages, and full user-scripted extensibility, even at the cost of efficiency and effectively locking processing to single-thread. OTOH... A primary focus of pan is on high performance multi-threaded download and saving of binary attachments, with clear emphasis on "high performance" and "multi-threaded", and with no assumption that the text message itself is of any interest at all as it may in fact be ONLY the saved binary attachments that are of interest. Clearly pan can reasonably effectively handle reading and replying to text messages as well and that in fact is the primary use-case of many users including myself, but pan's emphasis on efficient multi-threaded downloading means the claws-mail "we can and will stop everything and wait for the result of a user-script extension before proceeding" approach is simply not possible and ENTIRELY out of the question. That's also the reason for pan's distinction between "caching" and "downloading", and why pan's local message cache is relatively small (10 MB IIRC) by default. * "Downloading" in pan terms means caching whole multi-individual-text- post messages (where 100 or more such individual text messages are often entirely transparently combined to allow the next step), automatically extracting and saving off the attachments, then deleting the still entirely unread text messages from cache in ordered to make room for the next batch of individual text messages to be cached and their attachments saved off. By this definition ONLY the often rather large binary attachments matter, the text messages themselves are simply the container it attachments ship in and can be discarded after the attached contents are safely unpacked and stored. * By contrast, "caching" in pan terms is what claws-mail would call "downloading", that is, downloading the text message to cache, leaving it marked unread, ready for the user to read and save or reply to later as desired. This "text mode message processing", as opposed to "binary mode attachment processing", is conceptually an /entirely/ different work-flow where high efficiency multi-thread bulk-processing isn't as vital, but user-scripted extensibility such as claws-mail offers for this use-case might be. Pan's default 10 MB cache is rather small for this, and indeed, my text-instance[1] pan is configured as unexpiring with multi- gig cache so it doesn't delete anything, and I have the entire content of the couple dozen or so text groups I follow (including the pan lists, as newsgroups via gmane.org's list2news service), minus a few spam and troll posts, going back over ten years in several cases. =:^) I even have the contents of several of my ISP's old discussion newsgroups still cached, even tho they killed their news server some years ago now so there's actually no server left to connect to for new messages (tho a few of the former ISP-private groups live on as pretty much entirely spam-filled zombie groups on various commercial NSP services). It's important to get this difference, because back in the day when I had first come to pan on Linux from MSOE on MS Windows, I was originally rather dismayed to see messages I had "downloaded" but only to cache, believing them to be saved for reading and further processing later, suddenly disappearing again, still unread, as I downloaded (to cache) additional messages! The cache was still set to its default 10 MB, and I had 10 MB of messages downloaded, so pan was simply deleting the oldest ones from cache in ordered to make way for new ones. Once I figured out what was happening, I was able to set a far bigger cache (tho it was limited to I think 1 GB or some such back then, no such limits now), and my messages quit disappearing from cache before I even had a chance to read them! The point being, pan ASSUMES binary mode download, attachment saving and unread containing-text-message discard by default, and is optimized to process that as efficiently as it can. While it can /do/ text messages and in fact isn't actually a bad text-message news client at all, so much so that many folks including me actually use pan primarily for such text- mode messages and groups, that's not what it assumes or is optimized for, just as claws-mail doesn't assume nor is it optimized for the bulk-binary- mode attachment saving pan handles far more efficiently, even tho it can handle it in its slower single-threaded fashion. (Tho I don't actually know if claws-mail handles the more efficient yenc-encoded attachments common on binary newsgroups or not, but if it doesn't do so directly, there's certainly third-party-utilities that can do so, and claws-mail is certainly extensible enough to add that functionality as a third-party- script-extension if desired.) OK, now that we've dealt with the concepts underlying the practical, let's move on to their practical application in the context of your post. Again, pan's implementation is scoring combined with actions. We'll deal with them one at a time. 1) Scoring Pan's scoring system can work in one of two modes, absolute or incremental. In fact, pan's ignore and watch features are implemented simply enough as the extreme ends of absolute-mode scoring, -9999 (or lower) scores are ignored, (+)9999 (or higher) scores are interpreted as watched, and pan's ignore and watch features simply create scoring rules that set =-9999 and =9999 absolute scores. But in the absence of an applicable absolute (aka forced) score, pan will match any incremental mode scores that apply and the resulting score is the total of all incremental scoring rules. Further, there are several scoring "zones" in addition to the two extremes. If you have the score column active in your headers pane/tab, pan will color-code the scored posts by zone as configured on the colors tab of prefs, and can be set to show or hide posts by score zone as well, as accessed via the "match scores" section in the view, header pane submenu. As can be seen from both places, the scoring zones are as follows, lowest to highest: -9999 (and lower): ignored -9998 to -1: low (negative) 0: normal (zero/neutral, no scores apply or the effect of multiple scores combined is as if none applied at all) 1 to 4999: medium 5000 to 9998: high 9999 (and higher): watched These scoring zones are critical to the automated actions discussed below, but before we get to them, let's talk a bit more about how the scoring rules actually work. In general, pan's scoring rules are stored in a scorefile with a format based that of another news client, slrn, altho pan's implementation isn't as advanced as that of slrn. FWIW, at least one other news client, xnews (MS platform news client), uses a very similar scorefile format. Here's the slrn scorefile.txt documentation: http://slrn.sourceforge.net/docs/score.txt Again, keep in mind while reading that, that pan's implementation is very similar, but not as advanced. In particular, pan lacks support for the include directive, as well as for nested/grouped rules. An additional difference is that pan's processing is case insensitive. And one additional difference, presently, pan's scoring only supports logical OR: if ANY of the conditions match, the scoring rule is applied. As documented, logical OR is Score:: (double colon), while Score: (single colon /should/ be logical AND (only apply if ALL conditions match). Presently, however, Score: seems to behave like Score::, they both are treated as logical OR and ANY matching condition triggers the score. I'm not sure but I /believe/ this to be a bug as I could almost swear that logical AND (single colon) *USED* to work. But someone posted that they couldn't get it working and I tested and sure enough, all my scores were being applied as logical OR as well, so either it broke somewhere along the line or I'm mis-remembering and it never worked in the first place. Tho it can be noted that with pan's scoring zones and appropriate use of incremental scoring, /almost/ the same effect can be achieved by simply adjusting the score values of the various individual elements composing the would-be AND, such that selected posts only fall in the desired score- zone if ALL the appropriate conditions match, otherwise they'll fall into a different zone due to failure to match some of the conditions that incrementally combine to put it in the target zone. Meanwhile, it's also worthwhile to specifically point out that as pan creates the scores, MOST of lines pan actually writes into the scorefile are actually either blank lines or comments, due to the leading % comment indicator. You can thus trim all the %BOS and %EOS lines, etc, without affecting actual scoring functionality at all, as they're comments, there simply to clarify what pan was actually doing when it wrote the score, add date information, etc. Also, note the overview-headers-only recommendation, which applies to pan as well as slrn. In particular, while scoring can be done on any header, if the header isn't in the overview file, it's likely the whole post will have to be downloaded before the score can be applied. While for ignored posts especially that can still be better than having to actually see and deal with the post manually, it does mean it has to be downloaded to cache before the score can be applied, so if at all possible, it's MUCH better to score on overview-included headers only, thus avoiding the download entirely. Finally, here's a link (via gmane) to an earlier post of mine, with an except from my own scorefile (and some additional explanation/commentary) as an example of what a nicely organized hand edited scorefile can look like in practice. http://permalink.gmane.org/gmane.comp.gnome.apps.pan.user/8689 OK, that covers scoring, but other than showing or hiding posts and/or making their scores show up in pretty colors in the header pane's score column, of what practical USE are they? In particular, how can they be used to trigger automated pre-download filtering and selective download or delete-before-download? That's where actions come in! =:^) 2) Actions Once you understand how scoring works and master the art of writing good scoring rules, pan's still relatively new (automated) actions feature makes putting those scoring rules to practical use actually quite simple. =:^) Actions are configured in pan preferences on the actions tab, and are scoring-zone based. Depending on how much you want to rely on scoring to determine what's automatically processed, there are several suggested configurations possible. I'd recommend NOT setting automated delete or even mark-read just yet, until you've watched how your scoring config is working and are comfortable that it's working as intended and you're not going to be missing a whole lot of posts due to accidental ignore-score matches. In fact, here's what I recommend: Before setting up actions at all, do this: 1) In pan prefs, headers tab: Ensure that you have the score column enabled, and order it so it's in view all the time. Back in the main window, header pane/tab, expand or shrink the score and other columns as necessary to fit, while keeping the score column in view. 2) In the view menu, header pane submenu: Ensure that ALL the "match scores of" options, including "low" and "ignored", are checked. 3) Back in pan prefs, on the colors tab under header pane: Setup your colors for each score zone so you can tell the zones apart just by color. In particular, make sure ignored and low/negative score zone colors stand out, as well as watched and if you eventually intend to auto-download them, medium and high score zones as well. 4) In pan prefs, actions tab: Ensure that all actions are currently DISABLED, for testing mode. Now go back to using pan, setting up scores to put posts in the desired score zones as appropriate, and watching that the scores work as intended. In particular, be sure the low/negative and ignored zones aren't catching posts that you actually want to see. 5) After some time watching that posts are getting assigned to their intended score zones, when you're comfortable that they are... 6) Back in pan prefs, on the actions tab: Enable actions for score zones as appropriate. Here's a recommended example: Delete articles scoring at: -9999 or less (ignored) This is optional. If you're conservative, you might wish to keep this disabled instead. Mark articles read scoring at: -9998 to -1 (low) Alternatively, if you're not deleting ignored articles, you can set it to simply mark-read ignored. Assuming you have pan set to hide read posts, this will hide them, but the headers won't actually be deleted, so you can still set show read posts temporarily if you want to refer back to them, perhaps because someone (not scored so low so you see the post) referred to them in a quote and you want to read the entire post to get the context. Note that the above "negative actions" should work at the set level AND BELOW. So if you for instance set mark-read from low-zone, it should mark-read ignored-zone as well. The below "positive actions" should work the other way, at the set level AND ABOVE. So if you set cache articles scoring medium-zone, it should also cache those in the high and watched zones. Depending on whether you run pan in binary download-and-save-attachments or cache-and-process-later modes, and whether you want to auto-cache/ download medium and high scorezone posts or only watched posts, you can set these as appropriate, but: Read-text-mode example: Cache articles scoring at: 5000 to 9998 (high) Download-binary-mode example: Download attachments of articles scoring at: 9999 or more (watched) For the download example, after you're sure it's downloading as appropriate, you likely want to set the mark affected articles read option as well, since once the attachments are downloaded (and saved) you probably don't care to see them any longer. I believe once you have both scoring and actions setup appropriately, you'll find it does what you need quite well. Pre-processing filters? Why? That would only slow pan down! =:^) Tho since you posted to the dev list there's a reasonable chance that you can code in the C++ pan's written in, and if so, I'm sure no one would object if you found time to figure out why pan's AND scoring doesn't work and fix that, and if we're /really/ lucky, we might even get a patch for the missing include and/or nested/grouped scoring condition support, as well. Then pan's scoring support would /really/ rock! =:^) Several devs have in the past cloned pan's git repo to their github accounts and requested pulls from there when they have patches ready. =:^) --- [1] My text-instance: I run several separate pan instances each with its own config and cache, one for text, one for binaries, and one for temporary testing. Pan reads the PAN_HOME environmental var and uses that for its config and data if set, using the default (~/.pan2 on *ix anyway, I don't do windows). I use that in a wrapper script to point pan at the appropriate config depending on whether launched the bin, text or test wrapper. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman _______________________________________________ Pan-devel mailing list Pan-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/pan-devel