Frederick Eaton <frede...@ofb.net> writes:

>
> Suppose the filter script reads a message from a particular file and decides 
> that it is
> spam. How does the filter tell Notmuch that the message corresponding to that 
> file is spam?
> You seem to be saying below that the filter script should extract the 
> Message-ID and use it
> to identify the message to Notmuch, since file paths of the messages are not
> indexed. Probably what my script should be doing for each message is 
> appending a line to a
> batch file like this:
>
>      +spam -new -- id:some_message_id@foo
>      +inbox -new -- id:some_other@baz
>
> and then passing the batch file to "notmuch tag"?
>

Hello Fredrick, you are exactly correct. This is what I've written to handle 
spam filtering in
my notmuch post-new hook. Like you, I have notmuch configured to assign newly 
fetched mail with
tag "new"

notmuch search --output=messages 'tag:new' > /tmp/msgs
notmuch search --output=files 'tag:new' |\
    bogofilter -o0.7,0.7 -bt |\
    paste - /tmp/msgs |\
    awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
    notmuch tag --batch

This should run under any shell. My chosen filter is bogofilter. The -bt flags 
tell it to
operate on a stdin "batch" of file paths and return a "terse" summary of 
results e.g.

H 0.248913
S 0.999999

This script operates on the assumption that the order of results from notmuch 
queries are
always the same, which is fortunately true.

>>>I've tentatively concluded that the best way to locate each message in the 
>>>Notmuch database
>>>is to extract the Message-ID and search for it with "id:"? But the FAQ says 
>>>that multiple
>>>messages can have the same Message-ID (and some spam messages don't have one 
>>>at all).

Your instinct to use batch tagging and id: queries is correct. I collect my new 
message ids in
/tmp/msgs. These ids are unique, they are definitely unique enough to be used 
to tag individual
messages on a daily basis. If you prefer to tag entire threads as spam the 
moment a single
message is spam, you can simply use

notmuch search --output=threads 'tag:new' > /tmp/msgs

I prefer to manually mute threads with a mute tag, but Thread ids are 
definitely unique.

If you want auto-tag spam in an existing archive, then you will need to first 
manually tag a
good quantity of messages (100-1000) you consider to be spam and a good 
quantity of messages
(100-1000) you consider to be ham and use them to train the filter e.g.

notmuch search --output=files 'tag:spam' | bogofilter -bs
notmuch search --output=files 'tag:inbox' | bogofilter -bn

>>>If I could access the message using the filename that the script is 
>>>processing, it would
>>>seem slightly more reliable. It seems like there should be some way to allow 
>>>a Notmuch
>>>database entry to be accessed directly by filename, without even creating a 
>>>Notmuch-style
>>>search query containing that filename, but rather by passing the filename as 
>>>a command-line
>>>argument to "notmuch". It would be nice not to have to worry about quoting 
>>>and unquoting.
>>
>>I am not sure if this is useful, given that (presumably) Notmuch uses message 
>>IDs as
>>keys. Besides, those filenames are usually generated automatically and quite 
>>cryptic.
>
> It might be useful for the reasons I stated, namely in case the Message-ID 
> does not exist or
> is not unique.

I think mail that is successfully transmitted through a mail host necessarily 
obtains a message
id, but I might be wrong. I believe notmuch indexes on both it's own unique 
thread ids and the
message ids. Thereby further decreasing the already minuscule chance of message 
id collisions.

--
Best,
Panos
_______________________________________________
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org

Reply via email to