[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #39 from Mark Martinec 2010-01-18 10:06:49 UTC --- (In reply to comment #32) > I would imagine that treating the multi-byte characters as individual bytes > might bite us in ways similar to Bug 6183. (In reply to comment #3

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #38 from Justin Mason 2010-01-18 09:32:06 UTC --- regarding Bayes tokenization: use of byte-level breaks there is an explicit choice, not a bug. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?ta

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #37 from Warren Togami 2010-01-18 09:16:42 UTC --- (In reply to comment #35) > Actually, if you look at Bayes.pm you'll notice that it already has use bytes > so I'm not sure it will have any effect on bayes tokenization. O

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #36 from Henrik Krohns 2010-01-18 08:56:35 UTC --- (In reply to comment #35) > Actually, if you look at Bayes.pm you'll notice that it already has use bytes > so I'm not sure it will have any effect on bayes tokenization. If

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #33 from Warren Togami 2010-01-18 08:43:20 UTC --- (In reply to comment #31) > Does everyone agree that this is only isolated to small subsets of messages, > rather than affecting scan speed for all, or a majority of, messag

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #35 from Michael Parker 2010-01-18 08:46:16 UTC --- Actually, if you look at Bayes.pm you'll notice that it already has use bytes so I'm not sure it will have any effect on bayes tokenization. -- Configure bugmail: https:

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #34 from Henrik Krohns 2010-01-18 08:44:31 UTC --- On behalf of my corpus, +1 for releasing 3.3.0 also. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this ma

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #32 from Warren Togami 2010-01-18 08:26:41 UTC --- (In reply to comment #28) > Not surprisingly it affects Bayes, but only as slightly as the rules. Probably > tokens containing highbits etc. It's simple to test with sa-lear

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #31 from Justin Mason 2010-01-18 08:23:27 UTC --- Does everyone agree that this is only isolated to small subsets of messages, rather than affecting scan speed for all, or a majority of, messages? ("mass-check" is the approp

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #30 from Justin Mason 2010-01-18 05:57:29 UTC --- (In reply to comment #23) > so that's a barely-noticeable difference of 4 mails (out of 5000 hams) more > __HIGHBITS hits, and 2 less __TVD_SPACE_RATIO hits in spam. no scori

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-18 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #29 from Mark Martinec 2010-01-18 04:06:45 UTC --- (In reply to comment #24) > What is the exact patch to test "use bytes"? My corpus has lots of Japanese > mail and I could run masscheck. The exact location is not critica

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #28 from Henrik Krohns 2010-01-17 23:29:23 UTC --- Not surprisingly it affects Bayes, but only as slightly as the rules. Probably tokens containing highbits etc. It's simple to test with sa-learn and comparing dumps. -- Co

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #27 from Warren Togami 2010-01-17 23:18:18 UTC --- Even if the rule hits are roughly equivalent: * Is this functionally equivalent for the tokens going in/out of Bayes? * What about those reasons why this was removed years

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #26 from Henrik Krohns 2010-01-17 21:43:03 UTC --- Looking at mass runtimes it's probably worth exploring only after 3.3.0. I'll be enabling use bytes on my slow server though, since I get many bad messages and the effect is

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #25 from Michael Parker 2010-01-17 20:31:03 UTC --- Note that I've seen differences between 18 and 28 percent on a mass-check of around 50k spam and ham. I was seeing similar differences in my freqdiff, HIGHBITS was the hig

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #24 from Warren Togami 2010-01-17 16:58:35 UTC --- What is the exact patch to test "use bytes"? My corpus has lots of Japanese mail and I could run masscheck. But given 48 minutes vs 50 minutes is this really worthwhile of

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #23 from Justin Mason 2010-01-17 16:24:37 UTC --- (In reply to comment #22) > (In reply to comment #21) > > btw, I think this significant slowdown in 3.3.0 may be as a result of > > increased use of replace_rules rules, comp

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 John Hardin changed: What|Removed |Added CC||jhar...@impsec.org --- Comment #2

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #21 from Justin Mason 2010-01-16 15:58:59 UTC --- btw, I think this significant slowdown in 3.3.0 may be as a result of increased use of replace_rules rules, compared to when bug 4596 happened. -- Configure bugmail: https

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #20 from Henrik Krohns 2010-01-16 06:44:50 UTC --- Surprisinly total scan time was only 10min vs 13min in favor of use bytes. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #19 from Henrik Krohns 2010-01-16 06:38:51 UTC --- Here's a quick run of 10k ham + 10k spam on perl 5.10.0. $ ./freqdiff -c ham.log.nobytes ham.log.bytes 518 __HIGHBITS 64 T_HK_MUCHMONEY 52 __hk_million 39

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #18 from Michael Parker 2010-01-15 23:33:37 UTC --- Some history: http://svn.apache.org/viewvc?view=revision&revision=315047 And Bug 4596 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=em

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #17 from Warren Togami 2010-01-15 22:24:09 UTC --- There is no doubt that this makes it faster, but I don't see any discussion here verifying that it results in correct behavior. I personally am OK with further delay of 3.3

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Daryl C. W. O'Shea changed: What|Removed |Added CC||spamassas...@dostech.ca --

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Henrik Krohns changed: What|Removed |Added CC||h...@hege.li --- Comment #15 fr

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Mark Martinec changed: What|Removed |Added Platform|Sun |All OS/Version|Solaris

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #14 from Mark Martinec 2010-01-15 16:25:14 UTC --- Created an attachment (id=4644) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4644) my sample message, gzipped Here attached is my newsletter sample as use

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #13 from Warren Togami 2010-01-15 13:29:03 UTC --- The attached Test case takes ~2 seconds with Fedora 12 perl-5.10.0 here, and ~12 seconds with RHEL-5 perl-5.8.8. Could someone please attach an example mail that is an even

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #12 from Mark Martinec 2010-01-14 17:56:39 UTC --- I repeated the same test with 'sa-compile'-d rules, this time with disabled HitFreqsRuleTiming, disabled debugging and disabled bayes. The shown tests_pri_0 time corresponds

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #11 from Mark Martinec 2010-01-14 17:16:50 UTC --- Took a 176 KB message that showed slow processing in the log, and run it with and without 'use bytes' in Message.pm. Network tests disabled, some SARE rules, HitFreqsRuleTim

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Comment #10 from Michael Parker 2010-01-14 07:32:47 UTC --- Can someone do a freqdiff between two runs, especially with mail that has highbit characters. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Justin Mason changed: What|Removed |Added Priority|P5 |P2 Severity|normal

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Justin Mason changed: What|Removed |Added CC||j...@jmason.org --- Comment #8 f

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-05 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Matt Selsky changed: What|Removed |Added CC||sel...@columbia.edu -- Configure

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2010-01-03 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Mark Martinec changed: What|Removed |Added Target Milestone|Undefined |3.3.0 -- Configure bugmail: h

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2009-12-30 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 Larry Rosenbaum changed: What|Removed |Added CC||rosenbau...@ornl.gov --- Comm

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2008-01-15 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Additional Comments From [EMAIL PROTECTED] 2008-01-15 12:22 --- This problem still exists in v3.2.4. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee.

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-11 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Additional Comments From [EMAIL PROTECTED] 2007-08-11 20:18 --- fwiw, I would just change your LANG environment variable. --- You are receiving this mail because: --- You are the assignee for the bug, or are wat

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-11 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Additional Comments From [EMAIL PROTECTED] 2007-08-11 19:41 --- Side note: Mark Martinec suggested this might be due to UTF-8 encoding in the locale. While my test system did have en_US.UTF-8 as the LANG, resetting /etc/

Re: [Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-11 Thread Matt Kettler
Matt Kettler wrote: > Mark Martinec wrote: > >>> However, in mine the difference when using a "stock" 3.2.3 is barely >>> noticeable, going from 9 seconds to 8 seconds. >>> >>> Adding in a good handful of SARE rules (1365 extra rules, counting "score" >>> lines) makes the difference quite signif

Re: [Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-11 Thread Loren Wilton
Good catch, mine is UTF-8.. Not sure about the original reporter. Probably not, or they wouldn't be seeing the large difference they see. Loren

Re: [Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-11 Thread Matt Kettler
Mark Martinec wrote: >> However, in mine the difference when using a "stock" 3.2.3 is barely >> noticeable, going from 9 seconds to 8 seconds. >> >> Adding in a good handful of SARE rules (1365 extra rules, counting "score" >> lines) makes the difference quite significant. >> >> Without "use bytes"

Re: [Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-11 Thread Mark Martinec
> However, in mine the difference when using a "stock" 3.2.3 is barely > noticeable, going from 9 seconds to 8 seconds. > > Adding in a good handful of SARE rules (1365 extra rules, counting "score" > lines) makes the difference quite significant. > > Without "use bytes" and the SARE rules: > real

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-10 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Additional Comments From [EMAIL PROTECTED] 2007-08-10 19:06 --- Confirmed my test box can replicate the results using a crude: time spamassassin -t

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-10 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Additional Comments From [EMAIL PROTECTED] 2007-08-10 11:04 --- Created an attachment (id=4083) --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=4083&action=view) spamd debug log from test case --- You

[Bug 5590] Scantime is very long unless "use bytes" hack is used

2007-08-10 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5590 --- Additional Comments From [EMAIL PROTECTED] 2007-08-10 11:02 --- Created an attachment (id=4082) --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=4082&action=view) Test case --- You are receiving this m