Bug Tracker item #2796390, was opened at 2009-05-25 13:33 Message generated for change (Comment added) made by vadz You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=2796390&group_id=250683
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Vadim Zeitlin (vadz) Assigned to: Nobody/Anonymous (nobody) Summary: poor performance of tokenization functions Initial Comment: At least the functions _ds_decode_quoted() and _ds_degenerate_message() have O(N^2) asymptotic performance (where N is the length of the input message) as they iterate over a string and use memmove() inside the loop. In practice this means that processing a big but not huge (~2MB) base64-encoded message with a lot of HTML fragments inside it takes ages (~an hour) on a reasonably fast modern machine which is unacceptable. ---------------------------------------------------------------------- >Comment By: Vadim Zeitlin (vadz) Date: 2009-05-25 22:20 Message: Unfortunately the message is work-related and so I can't attach it here but I don't think there is anything special about it, it's just QP-encoded (sorry for base64 confusion!) and also has plenty of HTML fragments, although from rough "profiling" by breaking into the process with gdb and looking at what it was doing it seems that it's the QP-decoding which took most of the time. As for how to fix it: it seems obvious to me that decoding should be done on the fly, i.e. by copying source to destination and replacing =XX with their decoded values. This would make the algorithm O(N) and should completely solve the problem -- I really have no idea why is _ds_decode_quoted() doing the decoding in place. Another micro optimization would be to not use strtol() but this is really negligible compared to O(N^2) complexity. As for doing it myself, I did think about it and will try to do it. What bothers me is lack of unit tests for these functions which would allow me to check that I didn't break anything when modifying them. So I guess I'd need to add them too and this would require more time than I currently have -- but I'll see if I can find it. ---------------------------------------------------------------------- Comment By: Stevan Bajic (sbajic) Date: 2009-05-25 21:32 Message: Could you attach such an message (not huge but ~2MB) triggering the long decode time? Do you have any recommendation how to enhance the decoding? Or maybe you have a patch for speeding up the decoding? I have tested other code doing quoted printable decoding and most of them are slower then the one found in DSPAM. If you have any reference how the decoding can be done faster then let me know. btw: _ds_decode_quoted() is dealing with quoted-printable and not with base64 encoding. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=2796390&group_id=250683 ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://www.creativitycat.com _______________________________________________ Dspam-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-devel
