[Dspam-devel] [ dspam-Bug Tracker-2796390 ] poor performance of tokenization functions

SourceForge.net Wed, 27 May 2009 12:08:46 -0700

Bug Tracker item #2796390, was opened at 2009-05-25 13:33
Message generated for change (Comment added) made by sbajic
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=2796390&group_id=250683


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Vadim Zeitlin (vadz)
Assigned to: Nobody/Anonymous (nobody)
Summary: poor performance of tokenization functions

Initial Comment:
At least the functions _ds_decode_quoted() and _ds_degenerate_message() have 
O(N^2) asymptotic performance (where N is the length of the input message) as 
they iterate over a string and use memmove() inside the loop.

In practice this means that processing a big but not huge (~2MB) base64-encoded 
message with a lot of HTML fragments inside it takes ages (~an hour) on a 
reasonably fast modern machine which is unacceptable.

----------------------------------------------------------------------

>Comment By: Stevan Bajic (sbajic)
Date: 2009-05-27 21:01

Message:
Hallo Vadim

I did more tests and compiled the new code with GCC switches I normally
use and the result surprised me: On the normal mail I tested before the new
code is faster then the currently implemented code. Where you able to test
the code I posted here? Does it solve one part of your problem?

Steve

----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2009-05-27 01:17

Message:
Hallo Vadim

I made a quick test with a slightly modified decoding code then the one
posted here. I used a normal mail message with some parts in it encoded in
quoted printable and decoded it 1 million times without printing it to the
screen and this is the result:
new function: +/- 11 seconds
DSPAM original function: +/- 4 seconds

So the DSPAM function is way, way faster. But this is for 1'000'000 times
decoding a message. Okay. Now I took the whole README from 3.9.0 and
encoded it into quoted printable. I encoded each and every character. All
of them. This made the output +/- 3 times bigger in size and added 97004
encoded parts.

Now I went back to my test application and decoded it ONCE with the
version posted here (but slightly changed) and with the original DSPAM
version. The result is (for ONE message and ONE time decoding it and
printed it out):
new function: below 1 second (real    0m0.329s, user    0m0.010s, sys    
0m0.010s)
DSPAM original function: +/- 11 seconds (real    0m11.111s, user   
0m10.730s, sys     0m0.010s)


On normal mail stream with not much encoded stuff in it, the DSPAM
original code is the fastest. But as soon as you have a message with a lot
of encoded parts in it, then the DSPAM version is ultra bad. Very bad. And
my test was just with a small 300K message.

The same technique used in _ds_decode_quoted() for searching and replacing
is used in _ds_degenerate_message() for the HTML search/replace part. So I
assume the same effect as soon as you have a lot of HTML tags.

I think replacing the original _ds_decode_quoted() with the new one should
not be an issue for most users. It is slower for probably most of the users
but we are talking about a loss of 7 seconds per 1 million normal mails.
That's not much time (from my viewpoint).

What do you think?


----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2009-05-25 22:59

Message:
Do you mean something like this here:

static int hex_to_bin(unsigned char c) {
  switch (c) {
    case '0': return 0;
    case '1': return 1;
    case '2': return 2;
    case '3': return 3;
    case '4': return 4;
    case '5': return 5;
    case '6': return 6;
    case '7': return 7;
    case '8': return 8;
    case '9': return 9;
    case 'a': case 'A': return 10;
    case 'b': case 'B': return 11;
    case 'c': case 'C': return 12;
    case 'd': case 'D': return 13;
    case 'e': case 'E': return 14;
    case 'f': case 'F': return 15;
    default: return -1;
  }
}

char *
_ds_decode_quoted (const char *body)
{
  if (!body)
    return NULL;

  char *out, hexbuf[3];
  int i, j;
  size_t b_pos = 0, o_pos = 0, len;

  hexbuf[2] = '\0';
  len = strlen(body) + 1;
  out = malloc (len);

  if (out == NULL) {
    // LOG (LOG_CRIT, ERR_MEM_ALLOC);
    return NULL;
  }

  for (b_pos = 0; b_pos < len; b_pos++) {
    if (body[b_pos] != '_' && body[b_pos] != '=') {
      hexbuf[0] = body[b_pos];
    } else {
      if (body[b_pos] == '=') {
        if (b_pos + 1 <= len && body[b_pos+1] == '\n') { /* RFC2045 */
          /* =\n -> skip them */
          b_pos++;
          continue;
        }
        if (b_pos + 2 <= len && body[b_pos+1] == '\r' && body[b_pos+2] ==
'\n') { /* RFC2045 */
          /* =\r\n -> skip them */
          b_pos += 2;
          continue;
        }
        if (b_pos + 2 <= len && (i = hex_to_bin(body[b_pos+1])) >= 0 && (j
= hex_to_bin(body[b_pos+2])) >= 0) {
          hexbuf[0] = (i << 4 | j);
          b_pos += 2;
        } else {
          hexbuf[0] = body[b_pos];
        }
      } else { /* RFC2047 - replace '_' with space */
        hexbuf[0] = ' ';
      }
    }
    out[o_pos] = hexbuf[0];
    o_pos++;
  }

  out[o_pos] = '\0';
  return out;
}



----------------------------------------------------------------------

Comment By: Vadim Zeitlin (vadz)
Date: 2009-05-25 22:20

Message:
Unfortunately the message is work-related and so I can't attach it here but
I don't think there is anything special about it, it's just QP-encoded
(sorry for base64 confusion!) and also has plenty of HTML fragments,
although from rough "profiling" by breaking into the process with gdb and
looking at what it was doing it seems that it's the QP-decoding which took
most of the time.

As for how to fix it: it seems obvious to me that decoding should be done
on the fly, i.e. by copying source to destination and replacing =XX with
their decoded values. This would make the algorithm O(N) and should
completely solve the problem -- I really have no idea why is
_ds_decode_quoted() doing the decoding in place.

Another micro optimization would be to not use strtol() but this is really
negligible compared to O(N^2) complexity.

As for doing it myself, I did think about it and will try to do it. What
bothers me is lack of unit tests for these functions which would allow me
to check that I didn't break anything when modifying them. So I guess I'd
need to add them too and this would require more time than I currently have
-- but I'll see if I can find it.

----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2009-05-25 21:32

Message:
Could you attach such an message (not huge but ~2MB) triggering the long
decode time?
Do you have any recommendation how to enhance the decoding? Or maybe you
have a patch for speeding up the decoding?
I have tested other code doing quoted printable decoding and most of them
are slower then the one found in DSPAM. If you have any reference how the
decoding can be done faster then let me know.

btw: _ds_decode_quoted() is dealing with quoted-printable and not with
base64 encoding.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=2796390&group_id=250683

------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT 
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp as they present alongside digital heavyweights like Barbarian 
Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com 
_______________________________________________
Dspam-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-devel

[Dspam-devel] [ dspam-Bug Tracker-2796390 ] poor performance of tokenization functions

Reply via email to