Update of /cvsroot/spambayes/spambayes/scripts
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11624/scripts

Modified Files:
        sb_notesfilter.py 
Log Message:
Add improvements from Hugo Duncan:

 Add ability to log to a Notes log.
 Add ability to specify the name of the index file.
 Improve message headers.
 Add ability to specify password
 Save spam field
 Connect to database remotely, if possible

Plus some misc cleanup.

Index: sb_notesfilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/scripts/sb_notesfilter.py,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** sb_notesfilter.py   3 May 2004 02:08:13 -0000       1.8
--- sb_notesfilter.py   3 Dec 2004 05:09:42 -0000       1.9
***************
*** 1,11 ****
  #! /usr/bin/env python
  
! '''sb_notesfilter.py - Lotus Notes Spambayes interface.
! 
! Classes:
! 
! Abstract:
  
!     This module uses Spambayes as a filter against a Lotus Notes mail
      database.  The Notes client must be running when this process is
      executed.
--- 1,7 ----
  #! /usr/bin/env python
  
! '''sb_notesfilter.py - Lotus Notes SpamBayes interface.
  
!     This module uses SpamBayes as a filter against a Lotus Notes mail
      database.  The Notes client must be running when this process is
      executed.
***************
*** 45,49 ****
      folder for Ham that has mistakenly been classified as Spam.  If
      there is any there, move it to the Train as Ham folder, so
!     Spambayes will be less likely to make this mistake again.
  
      Mail that is classified as Ham or Unsure is left in the inbox.
--- 41,45 ----
      folder for Ham that has mistakenly been classified as Spam.  If
      there is any there, move it to the Train as Ham folder, so
!     SpamBayes will be less likely to make this mistake again.
  
      Mail that is classified as Ham or Unsure is left in the inbox.
***************
*** 53,57 ****
      You should occasionally select some Ham and move it to the Train
      as Ham folder, so Spambayes can tell the difference between Spam
!     and Ham. The goal is to maintain a relative balance between the
      number of Spam and the number of Ham that have been trained into
      the database. These numbers are reported every time this program
--- 49,53 ----
      You should occasionally select some Ham and move it to the Train
      as Ham folder, so Spambayes can tell the difference between Spam
!     and Ham. The goal is to maintain an approximate balance between the
      number of Spam and the number of Ham that have been trained into
      the database. These numbers are reported every time this program
***************
*** 59,67 ****
      the amount of Ham you receive, it may be very difficult to
      maintain this balance.  This is not a matter of great concern.
!     Spambayes will still make very few mistakes in this circumstance.
      But, if this is the case, you should review your Spam folder for
      falsely classified Ham, and retrain those that you find, on a
      regular basis.  This will prevent statistical error accumulation,
!     which if allowed to continue, would cause Spambayes to tend to
      classify everything as Spam.
  
--- 55,63 ----
      the amount of Ham you receive, it may be very difficult to
      maintain this balance.  This is not a matter of great concern.
!     SpamBayes will still make very few mistakes in this circumstance.
      But, if this is the case, you should review your Spam folder for
      falsely classified Ham, and retrain those that you find, on a
      regular basis.  This will prevent statistical error accumulation,
!     which if allowed to continue, would cause SpamBayes to tend to
      classify everything as Spam.
  
***************
*** 70,74 ****
      it keeps a pickled dictionary of notes mail ids, so that once a
      mail has been classified, it will not be classified again.  The
!     non-existence of is index file, named <local database>.sbindex,
      indicates to the system that this is an initialization execution.
      Rather than classify the inbox in this case, the contents of the
--- 66,70 ----
      it keeps a pickled dictionary of notes mail ids, so that once a
      mail has been classified, it will not be classified again.  The
!     non-existence of this index file, named <local database>.sbindex,
      indicates to the system that this is an initialization execution.
      Rather than classify the inbox in this case, the contents of the
***************
*** 91,95 ****
                              e.g. d27ml602/27/M/IBM
                            if specified, will initiate a replication
!             -f  folder  : Name of spambayes folder
                              must have subfolders: Spam
                                                    Ham
--- 87,91 ----
                              e.g. d27ml602/27/M/IBM
                            if specified, will initiate a replication
!             -f  folder  : Name of SpamBayes folder
                              must have subfolders: Spam
                                                    Ham
***************
*** 103,106 ****
--- 99,105 ----
                            statistics output would otherwise be lost when the
                            window closes.
+             -i filename : index file name
+             -W          : password
+             -L dbname   : log to database (template alog4.ntf)
              -o section:option:value :
                            set [section, option] in the options database
***************
*** 124,131 ****
      o sb_server style training/configuration interface?
      o parameter to retrain?
      o Suggestions?
  '''
  
! # This module is part of the spambayes project, which is Copyright 2002
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.
--- 123,131 ----
      o sb_server style training/configuration interface?
      o parameter to retrain?
+     o Use spambayes.message MessageInfo db's rather than own database.
      o Suggestions?
  '''
  
! # This module is part of the spambayes project, which is Copyright 2002-5
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.
***************
*** 154,158 ****
  
  
! def classifyInbox(v, vmoveto, bayes, ldbname, notesindex):
  
      # the notesindex hash ensures that a message is looked at only once
--- 154,158 ----
  
  
! def classifyInbox(v, vmoveto, bayes, ldbname, notesindex, log):
  
      # the notesindex hash ensures that a message is looked at only once
***************
*** 176,200 ****
          else:
              if not notesindex.has_key(nid):
- 
                  numdocs += 1
  
                  # Notes returns strings in unicode, and the Python
!                 # uni-decoder has trouble with these strings when
                  # you try to print them.  So don't...
  
!                 # The com interface returns basic data types as tuples
!                 # only, thus the subscript on GetItemValue
! 
!                 try:
!                     subj = doc.GetItemValue('Subject')[0]
!                 except:
!                     subj = 'No Subject'
! 
!                 try:
!                     body  = doc.GetItemValue('Body')[0]
!                 except:
!                     body = 'No Body'
! 
!                 message = "Subject: %s\r\n\r\n%s" % (subj, body)
  
                  # generate_long_skips = True blows up on occasion,
--- 176,186 ----
          else:
              if not notesindex.has_key(nid):
                  numdocs += 1
  
                  # Notes returns strings in unicode, and the Python
!                 # decoder has trouble with these strings when
                  # you try to print them.  So don't...
  
!                 message = getMessage(doc)
  
                  # generate_long_skips = True blows up on occasion,
***************
*** 218,261 ****
                  try:
                      print "%s spamprob is %s" % (subj[:30], prob)
                  except UnicodeError:
                      print "<subject not printed> spamprob is %s" % (prob)
  
          doc = v.GetNextDocument(doc)
  
      # docstomove list is built because moving documents in the middle of
!     # the classification loop looses the iterator position
      for doc in docstomove:
          doc.RemoveFromFolder(v.Name)
          doc.PutInFolder(vmoveto.Name)
  
!     print "%s documents processed" % (numdocs)
!     print "   %s classified as spam" % (numspam)
!     print "   %s classified as ham" % (numham)
!     print "   %s classified as unsure" % (numuns)
  
  
! def processAndTrain(v, vmoveto, bayes, is_spam, notesindex):
  
      if is_spam:
!         str = options["Headers", "header_spam_string"]
      else:
!         str = options["Headers", "header_ham_string"]
  
!     print "Training %s" % (str)
  
      docstomove = []
      doc = v.GetFirstDocument()
      while doc:
!         try:
!             subj = doc.GetItemValue('Subject')[0]
!         except:
!             subj = 'No Subject'
! 
!         try:
!             body  = doc.GetItemValue('Body')[0]
!         except:
!             body = 'No Body'
! 
!         message = "Subject: %s\r\n%s" % (subj, body)
  
          options["Tokenizer", "generate_long_skips"] = False
--- 204,273 ----
                  try:
                      print "%s spamprob is %s" % (subj[:30], prob)
+                     if log:
+                         log.LogAction("%s spamprob is %s" % (subj[:30],
+                                                              prob))
                  except UnicodeError:
                      print "<subject not printed> spamprob is %s" % (prob)
+                     if log:
+                         log.LogAction("<subject not printed> spamprob " \
+                                       "is %s" % (prob,))
+ 
+                 item = doc.ReplaceItemValue("Spam", prob)
+                 item.IsSummary = True
+                 doc.save(False, True, False)
  
          doc = v.GetNextDocument(doc)
  
      # docstomove list is built because moving documents in the middle of
!     # the classification loop loses the iterator position
      for doc in docstomove:
          doc.RemoveFromFolder(v.Name)
          doc.PutInFolder(vmoveto.Name)
  
!     print "%s documents processed" % (numdocs,)
!     print "   %s classified as spam" % (numspam,)
!     print "   %s classified as ham" % (numham,)
!     print "   %s classified as unsure" % (numuns,)
!     if log:
!         log.LogAction("%s documents processed" % (numdocs,))
!         log.LogAction("   %s classified as spam" % (numspam,))
!         log.LogAction("   %s classified as ham" % (numham,))
!         log.LogAction("   %s classified as unsure" % (numuns,))
  
+ def getMessage(doc):
+     try:
+         subj = doc.GetItemValue('Subject')[0]
+     except:
+         subj = 'No Subject'
  
!     try:
!         body  = doc.GetItemValue('Body')[0]
!     except:
!         body = 'No Body'
! 
!     hdrs = ''
!     for item in doc.Items:
!         if item.Name == "From" or item.Name == "Sender" or \
!            item.Name == "Received" or item.Name == "ReplyTo":
!             try:
!                 hdrs = hdrs + ( "%s: %s\r\n" % (item.Name, item.Text) )
!             except:
!                 hdrs = ''
! 
!     message = "%sSubject: %s\r\n\r\n%s" % (hdrs, subj, body)
!     return message
  
+ def processAndTrain(v, vmoveto, bayes, is_spam, notesindex, log):
      if is_spam:
!         header_str = options["Headers", "header_spam_string"]
      else:
!         header_str = options["Headers", "header_ham_string"]
  
!     print "Training %s" % (header_str,)
  
      docstomove = []
      doc = v.GetFirstDocument()
      while doc:
!         message = getMessage(doc)
  
          options["Tokenizer", "generate_long_skips"] = False
***************
*** 276,280 ****
          bayes.learn(tokens, is_spam)
  
!         notesindex[nid] = str
          docstomove += [doc]
          doc = v.GetNextDocument(doc)
--- 288,292 ----
          bayes.learn(tokens, is_spam)
  
!         notesindex[nid] = header_str
          docstomove += [doc]
          doc = v.GetNextDocument(doc)
***************
*** 284,300 ****
          doc.PutInFolder(vmoveto.Name)
  
!     print "%s documents trained" % (len(docstomove))
  
  
! def run(bdbname, useDBM, ldbname, rdbname, foldname, doTrain, doClassify):
      bayes = storage.open_storage(bdbname, useDBM)
  
      try:
!         fp = open("%s.sbindex" % (ldbname), 'rb')
      except IOError, e:
!         if e.errno != errno.ENOENT: raise
          notesindex = {}
!         print "%s.sbindex file not found, this is a first time run" \
!               % (ldbname)
          print "No classification will be performed"
      else:
--- 296,315 ----
          doc.PutInFolder(vmoveto.Name)
  
!     print "%s documents trained" % (len(docstomove),)
!     if log:
!         log.LogAction("%s documents trained" % (len(docstomove),))
  
  
! def run(bdbname, useDBM, ldbname, rdbname, foldname, doTrain, doClassify,
!         pwd, idxname, logname):
      bayes = storage.open_storage(bdbname, useDBM)
  
      try:
!         fp = open(idxname, 'rb')
      except IOError, e:
!         if e.errno != errno.ENOENT:
!             raise
          notesindex = {}
!         print "%s file not found, this is a first time run" % (idxname,)
          print "No classification will be performed"
      else:
***************
*** 302,347 ****
          fp.close()
  
      sess = win32com.client.Dispatch("Lotus.NotesSession")
      try:
!         sess.initialize()
      except pywintypes.com_error:
          print "Session aborted"
          sys.exit()
  
!     db = sess.GetDatabase("",ldbname)
  
      vinbox = db.getView('($Inbox)')
!     vspam = db.getView("%s\Spam" % (foldname))
!     vham = db.getView("%s\Ham" % (foldname))
!     vtrainspam = db.getView("%s\Train as Spam" % (foldname))
!     vtrainham = db.getView("%s\Train as Ham" % (foldname))
  
      if doTrain:
!         processAndTrain(vtrainspam, vspam, bayes, True, notesindex)
          # for some reason, using inbox as a target here loses the mail
!         processAndTrain(vtrainham, vham, bayes, False, notesindex)
  
!     if rdbname:
!         print "Replicating..."
!         db.Replicate(rdbname)
!         print "Done"
  
      if doClassify:
!         classifyInbox(vinbox, vtrainspam, bayes, ldbname, notesindex)
  
      print "The Spambayes database currently has %s Spam and %s Ham" \
!         % (bayes.nspam, bayes.nham)
  
      bayes.store()
  
!     fp = open("%s.sbindex" % (ldbname), 'wb')
      pickle.dump(notesindex, fp)
      fp.close()
  
  
- if __name__ == '__main__':
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htcPd:p:l:r:f:o:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
--- 317,393 ----
          fp.close()
  
+     need_replicate = False
+ 
      sess = win32com.client.Dispatch("Lotus.NotesSession")
      try:
!         if pwd:
!             sess.initialize(pwd)
!         else:
!             sess.initialize()
      except pywintypes.com_error:
          print "Session aborted"
          sys.exit()
+     try:
+         db = sess.GetDatabase(rdbname, ldbname)
+     except pywintypes.com_error:
+         if rdbname:
+             print "Could not open database remotely, trying locally"
+             try:
+                 db = sess.GetDatabase("", ldbname)
+                 need_replicate = True
+             except pywintypes.com_error:
+                 print "Could not open database"
+                 sys.exit()
+         else:
+             raise
  
!     log = sess.CreateLog("SpambayesAgentLog")
!     try:
!         log.OpenNotesLog("", logname)
!     except pywintypes.com_error:
!         print "Could not open log"
!         log = None
! 
!     if log:
!         log.LogAction("Running spambayes")
  
      vinbox = db.getView('($Inbox)')
!     vspam = db.getView("%s\Spam" % (foldname,))
!     vham = db.getView("%s\Ham" % (foldname,))
!     vtrainspam = db.getView("%s\Train as Spam" % (foldname,))
!     vtrainham = db.getView("%s\Train as Ham" % (foldname,))
  
      if doTrain:
!         processAndTrain(vtrainspam, vspam, bayes, True, notesindex, log)
          # for some reason, using inbox as a target here loses the mail
!         processAndTrain(vtrainham, vham, bayes, False, notesindex, log)
  
!     if need_replicate:
!         try:
!             print "Replicating..."
!             db.Replicate(rdbname)
!             print "Done"
!         except pywintypes.com_error:
!             print "Could not replicate"
  
      if doClassify:
!         classifyInbox(vinbox, vtrainspam, bayes, ldbname, notesindex, log)
  
      print "The Spambayes database currently has %s Spam and %s Ham" \
!           % (bayes.nspam, bayes.nham)
  
      bayes.store()
  
!     fp = open(idxname), 'wb')
      pickle.dump(notesindex, fp)
      fp.close()
  
+     if log:
+         log.LogAction("Finished running spambayes")
  
  
+ if __name__ == '__main__':
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htcPd:p:l:r:f:o:i:W:L:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
***************
*** 351,354 ****
--- 397,403 ----
      rdbname = None  # remote notes database location
      sbfname = None  # spambayes folder name
+     idxname = None  # index file name
+     logname = None  # log database name
+     pwd = None # password
      doTrain = False
      doClassify = False
***************
*** 371,381 ****
          elif opt == '-P':
              doPrompt = True
          elif opt == '-o':
              options.set_from_cmdline(arg, sys.stderr)
      bdbname, useDBM = storage.database_type(opts)
  
      if (bdbname and ldbname and sbfname and (doTrain or doClassify)):
          run(bdbname, useDBM, ldbname, rdbname, \
!             sbfname, doTrain, doClassify)
  
          if doPrompt:
--- 420,439 ----
          elif opt == '-P':
              doPrompt = True
+         elif opt == '-i':
+             idxname = arg
+         elif opt == '-L':
+             logname = arg
+         elif opt == '-W':
+             pwd = arg
          elif opt == '-o':
              options.set_from_cmdline(arg, sys.stderr)
      bdbname, useDBM = storage.database_type(opts)
  
+     if not idxname:
+         idxname = "%s.sbindex" % (ldbname)
+ 
      if (bdbname and ldbname and sbfname and (doTrain or doClassify)):
          run(bdbname, useDBM, ldbname, rdbname, \
!             sbfname, doTrain, doClassify, pwd, idxname, logname)
  
          if doPrompt:

_______________________________________________
Spambayes-checkins mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/spambayes-checkins

Reply via email to