On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
> On 11.12.19 11:43, Henrik K wrote:
> >Wow 6 million tokens.. :-)
> >
> >I assume the big uuencoded blob content-type is text/* since it's tokenized?
> 
> yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
> 
> grep -c '^M' spamassassin-memory-error-<...>
> 329312
> 
> One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
> tokens eating about 4G of RAM means ~750B per token, is that fine?

I'm pretty sure the Bayes code does many dumb things with the tokens
that result in much memory usage for abnormal cases like this.

> >This will be mitigated in 3.4.3, since it will only use max 50k of the body
> >text (body_part_scan_size).
> 
> will it prefer test parts and try to avoid uuencoded or base64 parts?
> (or maybe decode them?)

There is no change in how parts are processed.  As before, "body" is
concatenated result of all textual parts.  But in 3.4.3 atleast each part is
truncated to 50k.  If there are several parts then it's 50+50k etc..

Reply via email to