[Dspam-devel] [ dspam-Bug Tracker-2796390 ] poor performance of tokenization functions

SourceForge.net Mon, 25 May 2009 13:25:26 -0700

Bug Tracker item #2796390, was opened at 2009-05-25 13:33
Message generated for change (Comment added) made by vadz
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=2796390&group_id=250683


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Vadim Zeitlin (vadz)
Assigned to: Nobody/Anonymous (nobody)
Summary: poor performance of tokenization functions

Initial Comment:
At least the functions _ds_decode_quoted() and _ds_degenerate_message() have 
O(N^2) asymptotic performance (where N is the length of the input message) as 
they iterate over a string and use memmove() inside the loop.

In practice this means that processing a big but not huge (~2MB) base64-encoded 
message with a lot of HTML fragments inside it takes ages (~an hour) on a 
reasonably fast modern machine which is unacceptable.

----------------------------------------------------------------------

>Comment By: Vadim Zeitlin (vadz)
Date: 2009-05-25 22:20

Message:
Unfortunately the message is work-related and so I can't attach it here but
I don't think there is anything special about it, it's just QP-encoded
(sorry for base64 confusion!) and also has plenty of HTML fragments,
although from rough "profiling" by breaking into the process with gdb and
looking at what it was doing it seems that it's the QP-decoding which took
most of the time.

As for how to fix it: it seems obvious to me that decoding should be done
on the fly, i.e. by copying source to destination and replacing =XX with
their decoded values. This would make the algorithm O(N) and should
completely solve the problem -- I really have no idea why is
_ds_decode_quoted() doing the decoding in place.

Another micro optimization would be to not use strtol() but this is really
negligible compared to O(N^2) complexity.

As for doing it myself, I did think about it and will try to do it. What
bothers me is lack of unit tests for these functions which would allow me
to check that I didn't break anything when modifying them. So I guess I'd
need to add them too and this would require more time than I currently have
-- but I'll see if I can find it.

----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2009-05-25 21:32

Message:
Could you attach such an message (not huge but ~2MB) triggering the long
decode time?
Do you have any recommendation how to enhance the decoding? Or maybe you
have a patch for speeding up the decoding?
I have tested other code doing quoted printable decoding and most of them
are slower then the one found in DSPAM. If you have any reference how the
decoding can be done faster then let me know.

btw: _ds_decode_quoted() is dealing with quoted-printable and not with
base64 encoding.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=2796390&group_id=250683

------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
_______________________________________________
Dspam-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-devel

[Dspam-devel] [ dspam-Bug Tracker-2796390 ] poor performance of tokenization functions

Reply via email to