Patches item #1532856, was opened at 2006-08-01 21:02 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=1532856&group_id=61702
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Skip Montanaro (montanaro) Assigned to: Nobody/Anonymous (nobody) Summary: Compute size of embedded images Initial Comment: Attached is a tokenizer patch that generates int(log2(size)) tokens for embedded images. It seems clear we have to do more about image-based spam. This seems like a cheap trick, and at least for my current corpus generates a fair number of spammy clues: token,nspam,nham,spam prob image-size:2**6,4,1,0.5 image-size:2**7,4,1,0.5 image-size:2**5,1,0,0.844827586207 image-size:2**8,6,0,0.96511627907 image-size:2**9,3,0,0.934782608696 image-size:2**10,7,1,0.620791675168 image-size:2**11,9,0,0.97619047619 image-size:2**12,13,0,0.983271375465 image-size:2**13,14,0,0.984429065744 image-size:2**14,53,0,0.995790458372 image-size:2**15,19,1,0.813543282782 Of course, it may not improve discrimination with tested more rigorously, but it might be worth a try. I haven't done any NxN testing. I no longer have more training messages laying about than is necessary for my day-to-day needs. Note that the patch will apply to current sources with an offset or two. I have a couple other mods in my current source code that I excised from the diffs. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=1532856&group_id=61702 _______________________________________________ Spambayes-bugs mailing list Spambayes-bugs@python.org http://mail.python.org/mailman/listinfo/spambayes-bugs