On [08/19/03 23:14], Tomasz Kojm wrote:
> On Tue, 19 Aug 2003 15:10:53 -0400
> Yevgeniy Miretskiy <[EMAIL PROTECTED]> wrote:
> 
> > Is there an alpha (or even pre alpha) version of this implementation?
> 
> No, there isn't :(
> 

That's unfortunate -- I really though I could get my hands on this code :(

Anyway, currently myself + couple other people are working on research project 
(antivirus fs systems for Linux) where we are using clamav to provide 
kernel based virus detection.

While experimenting with clamav, we found that clamav performance can be 
significantly improved by increasing number of levels in the search trie.

Below are the results of timing clamscan, scanning a 2 gigabyte
VMware virtual disk which does not contain any viruses.
During our benchmarking, we ran similar tests many times with roughly
the same results.  The table below shows speed improvements (over 1 run).
Currently, clamav uses a 2 level trie.

Time | Level 2   |  Level 3  | Level 4   | Level 5
-----------------------------------------------------
real | 3m56.477s | 1m47.712s | 1m40.420s | 1m31.998s
user | 3m19.270s | 1m18.230s | 1m7.070s  | 1m0.020s
sys  | 0m8.770s  | 0m6.400s  | 0m8.710s  | 0m7.090

Memory usage increases by roughly 5-7 MB per each level.
Level 5 memory usage is around 25 MB.
Considering that most people use clamd, I think 25MB
usage for 1 process with 3X performance is a fair tradeoff.

We found that going beyond L5 does not buy you anything
in terms of speed -- only increased memory usage.
You can use attached dbstats.pl program to get an idea
of what the optimal level might be (run the program
giving it a list of viruses database files on command line).

The speed of virus clamav improves with additional trie levels
because the average and the maximum patterns linked list
length decreases.
We are looking for a level which has avg linked list close
to 1 and has relativelly small max linked list length.
For example, the output below shows the difference between
the trie with 2 and 3 levels:

Level 2:
        Unique Prefixes: 3529
        Avg Linked List Length: 2.23
        Avg Linked List Length Descrease: 92.82%
        Min Linked List Length: 1.00
        Max Linked List Length: 240.00
        Memory Usage: 948.29 KB
        Memory Usage Increase: 27.55%
        Number of cl_node structures: 3945
Level 3:
        Unique Prefixes: 5080
        Avg Linked List Length: 1.55
        Avg Linked List Length Descrease: 30.49%
        Min Linked List Length: 1.00
        Max Linked List Length: 238.00
        Memory Usage: 4728.38 KB
        Memory Usage Increase: 79.94%
        Number of cl_node structures: 9225

Percentages above indicate improvements from previous level.    
dbstats.pl program terminates when avg linked list length becomes 1.

We also ran benchmarks with much larger virus database
(we made up about 80000 addition "signature"), and, not surprisingly,
found that speed improvements with more signatures are even more significant.

Also attached you will find a patch against clamav-0.60 which adds
configuration option to change trie depth.

Let me know how it goes -- hope you find this info helpfull.

-- 
  Eugene Miretskiy <[EMAIL PROTECTED]>
  INVISION.COM, INC.  (631) 543-1000
  www.invision.net  /  www.longisland.com 
Only in clamav-0.60.new: autom4te.cache
diff -ru -x aclocal.m4 -x Makefile.in -x configure clamav-0.60/configure.in 
clamav-0.60.new/configure.in
--- clamav-0.60/configure.in    2003-06-20 23:05:32.000000000 -0400
+++ clamav-0.60.new/configure.in        2003-08-20 17:11:49.000000000 -0400
@@ -176,6 +176,13 @@
 AC_SUBST(CFGDIR)
 AC_DEFINE_UNQUOTED(CONFDIR,"$cfg_dir",)
 
+dnl search tree depth
+AC_ARG_WITH(depth, 
+[  --with-depth=number   number of levels in pattern search tree (default=2).],
+tree_depth=$withval, tree_depth=2)
+
+AC_DEFINE_UNQUOTED(CL_MIN_LENGTH,$tree_depth,)
+
 dnl Do not overwrite the current config file
 AM_CONDITIONAL(INSTALL_CONF, test ! -r "$cfg_dir/clamav.conf")
 
Only in clamav-0.60.new: configure.in.orig
Only in clamav-0.60.new: configure.in.patch
diff -ru -x aclocal.m4 -x Makefile.in -x configure clamav-0.60/libclamav/clamav.h 
clamav-0.60.new/libclamav/clamav.h
--- clamav-0.60/libclamav/clamav.h      2000-03-15 20:05:00.000000000 -0500
+++ clamav-0.60.new/libclamav/clamav.h  2003-08-20 16:48:27.000000000 -0400
@@ -30,7 +30,10 @@
  
 
 #define CL_NUM_CHILDS 256
-#define CL_MIN_LENGTH 2
+
+#ifndef CL_MIN_LENGTH
+  #define CL_MIN_LENGTH 2
+#endif
 
 #define CL_COUNT_PRECISION 4096
 
diff -ru -x aclocal.m4 -x Makefile.in -x configure clamav-0.60/libclamav/matcher.c 
clamav-0.60.new/libclamav/matcher.c
--- clamav-0.60/libclamav/matcher.c     2000-01-09 17:15:00.000000000 -0500
+++ clamav-0.60.new/libclamav/matcher.c 2003-08-20 17:24:48.000000000 -0400
@@ -37,10 +37,6 @@
        struct cl_node *pos, *next;
        int i;
 
-    if(pattern->length < CL_MIN_LENGTH) {
-       return CL_EPATSHORT;
-    }
-
     pos = root;
 
     for(i = 0; i < CL_MIN_LENGTH; i++) {
@@ -177,7 +173,7 @@
 {
        struct cl_node *current;
        struct patt *pt;
-       int i, position, *partcnt;
+       int i, position, virfound, *partcnt;
 
     current = (struct cl_node *) root;
 
@@ -188,10 +184,11 @@
 
        if(current->islast) {
            position = i - CL_MIN_LENGTH + 1;
-
            pt = current->list;
+           virfound = (pt->length < CL_MIN_LENGTH);
+
            while(pt) {
-               if(cli_findpos(buffer, position, length, pt)) {
+               if(virfound || cli_findpos(buffer, position, length, pt)) {
                    if(pt->sigid) { /* it's a partial signature */
                        if(partcnt[pt->sigid] + 1 == pt->partno) {
                            if(++partcnt[pt->sigid] == pt->parts) { /* last */

Attachment: dbstats.pl
Description: Perl program

Attachment: pgp00000.pgp
Description: PGP signature

Reply via email to