I've done some analysis of ClamAV with just this signature set, and the
loading is simply slowing down as it runs through the list. This is mainly
because of the significant amounts of overlap at the beginnings of these
strings and the length thereafter. The slowdown is occurring even before
the tries are created (after all signatures are loaded). It has to do with
the way the signatures are getting sorted and managed in an intermediate
state between loading the signature file and the "final" scanning trie. I
will try to strike a balance between technical and TL;DR but here are some
details.

First, some qualities within the dataset of the current
bofhland_cracked_URL.ndb file:
89 thousand signatures, (really ~44,500 written twice to get them loaded
into the HTML and MAIL targets)
53 thousand start with "www." (7777772E)
14 thousand start with "t.co" (742E636F)
Each signature is a single unique pattern, within the 44,500. No
subpatterns, no wildcards, nothing for ClamAV to break it into pieces.

Why some wildcard replacement works to gain load speed:
The reason that adding replacing certain bytes with wildcards works is that
ClamAV can treat that wildcard as a subpattern breakpoint. Even better,
when the initial subpattern is an exact repeat then 100%-matching overlaps
can be handled differently by the code. Best loadtime bang-for-the-buck I
got was replacing this: "7777772E" with this: "777777{1}"
End-to-end clamscan runtime in a VM before the simple replacement: Time:
62.540 sec (1 m 2 s)
End-to-end clamscan runtime in a VM after the simple replacement: Time:
2.965 sec (0 m 2 s)
All I did was a replace command in vim.

The trade-off, because everything has a cost:
(1) Mildly less accurate, since the dot has been replaced with 1 of any
character. But with strings that are all this long it should still be very
specific.
(2) More subpatterns equals more memory:
Original report --- LibClamAV debug: pool memory used: 36.734 MB
After replacement --- LibClamAV debug: pool memory used: 48.855 MB
With more subpatterns to track, the extra tracking comes with a price, and
that price is in memory.

I'll look a bit more at how we are loading the interim signature state and
see what else we could do with the sorting. Meanwhile, this is a change you
could put into practice now and get faster startup times. Before making any
change on a server directly, you can test a modified DB with clamscan to
see the difference.

My testing VM is 64-bit Debian, if it matters.

Hope this helps,

Dave R.



On Wed, Aug 14, 2013 at 12:40 PM, Matt Olney <mol...@sourcefire.com> wrote:

> OK, we've been able to reproduce the problem and it is, as you all
> suspected revolving around the www. matching.  I've asked one of the
> developers to look at it, and we should be able to provide some
> best-practice guidelines on how to construct rules to avoid this situation.
>  We'll also review if code changes are appropriate, but given how the tree
> operates, I don't immediately expect that to be the case.
>
> Matt
> _______________________________________________
> Help us build a comprehensive ClamAV guide: visit http://wiki.clamav.net
> http://www.clamav.net/support/ml
>



-- 
---
Dave Raynor
Sourcefire Vulnerability Research Team
dray...@sourcefire.com
_______________________________________________
Help us build a comprehensive ClamAV guide: visit http://wiki.clamav.net
http://www.clamav.net/support/ml

Reply via email to