On [08/23/03 02:00], Tomasz Kojm wrote:
> On Fri, 22 Aug 2003 13:22:10 -0400
> Yevgeniy Miretskiy <[EMAIL PROTECTED]> wrote:
> 
> > While this memory usage is about 2x of usage on Linux, it's still
> > acceptible to me.
> 
> But it should not be acceptable to miss the polymorphic viruses I
> mentioned.

See below.

> 
> Yevgeniy, please don't offend but your observations are obvious to me.
> I analyzed this algorithm very deeply and even had few lectures on it. I
> can send you some papers/slides (unfortunately only in polish) if you
> want. 

Thanks for the offer.  I found quite a lot of material on the web, and
studied & discussed the algorithm in detail with other people.  

> Because the algorithm is data-dependend (which are very random in
> out case) 

Let's clarify something: by data-dependent do you mean that it depends
on input?  or do you mean that it depends on the database?

If you're saying that the algorithm memory usage/speed
is input dependent, then I think you're wrong.  In a case of
direct Aho-corassick implementation, it's always linear.  In clamav case,
it's a bit worse.  How much worse depends on the number of patterns
hanging in a linked list off of trie node.

If you're saying that the algorithm memory usage/speed depends on 
the database, then I would agree with you.
But, by saying that the data (i.e. the database) is very random,
are you stating something that's a fact, or is something that you
think is a fact? Define random please.  
The way I define it, is that if the data was indeed random, 
then every possible prefix of a pattern would be equally likelly to appear.  
This would mean, that a 2 character prefix should yeald 65K differnt patterns.  
Is you database like this?

Your input is random, but the database is not.  The database is finite in
size, and is constant. 

> there's no good estimator for a speed of memory growth. 

I think there are pretty good estimators for both.  You can estimate
the speed of algorithm as linear to input size (not counting loading/unloading
of the trie).  You can also estimate memory usage _very_ accuratly given
input pattern database.

> So the  word _BRUTAL_ should rather be swapped with _UNPREDICTABLE_ (of course
> the upper limit of memory usage may be calculated). This is not
> acceptable, because some software eg. qmail-scanner uses a hardcoded
> softlimit value which stops clamscan if it exceeds the limit. Please
> check qmail-scanner archives and I hope you will realize the situation.

How exactly is your memory usage is unpredictable?  I really don't follow
your logic.  Not only is it predictable, it can be calculated _EXACTLY_
even before running clamav. Or, if you don't feel like calculating, just
run it, and watch top.  Clamav memory usage should _never_ go up after
db is loaded and first buffer is read.

> 
> ClamAV 0.2x had CL_MIN_LENGTH hardcoded to 5 and there were critical
> problems(hangups, etc.) after a big database update.

Well, I guess 0.6 improved quite a bit as compared to 0.2.  
We had absolutelly _no_ problems reloading 5 level trie with 80K 
patterns while running in kernel mode, which is probably a bit
more unforgiving then userlevel program.  We went as far
as 8 levels, just to test things, again, w/out any problems.

> 
> > As you can see, the increase is not nearly exponential. As a matter of
> > fact, as you add more levels, you're increase your memory usage by
> > smaller, and smaller percentage as compared to previous level
> > increase.
> 
> Do you really want an anti-virys software which consumes 50 MB of your
> system's memory ?

Why NOT?  I have 1 process that consumes 50MB.  Every modern OS supports
copy on write.  I don't have to fork off 50MB for each scanner instance.
Since root trie is readonly, this 50 MB is loaded ONCE.
I'm sorry, but as I stated before, I think it should be left up to the
user to deside how much ram can be allocated to clamav.
Beside, if I run dedicated scanner servers with 2 GB of ram, why the hell shouldn't 
I use that ram for scanning?

> 
> > > 3) higher levels brake some polymorphic signatures (eg.
> > > W32/Magistr.B,   W32/Hybris.C), because we don't realize regular
> > > expressions in the   trie
> > 
> > Please elaborate. I think poly viruses (even with short subpatterns)
> > will be detected just fine.
> 
> No, they won't. For speed (and simplicity) reason we do not realize
> regular expression matching in the trie. Consider the signature for
> W32/Hybris.C:
> 
> 4000??????????????????????????83??????75f2e9????ffff00000000
> 
> Actually (with CL_MIN_LENGTH = 2) only the first two bytes (4000) are
> inside of the trie (the rest is being kept in a linked list in a
> corresponding node). You should guess what will the increase of the
> height do.

I'm sorry, but this makes not sense to me.  
First 2 characters (4000) will be used to locate some node on the second
level of the trie.  Then entire pattern will be added to that nodes linked
list. The matching will continue the same way whether it's a 2, or 5 level 
trie.  Very simply, the nodes that contain pattern linked lists are marked
with is_last=1 (the name should probably change).

Why don't you try running the patched clamav with 5 (or however many) levels  
on Hybris.C virus and see if it detects it.  I just did -- detected it
just fine.

> 
> > Now this is where you are not entire correct.  Running scanners with
> > random database is a nice benchmark of WORST case scenario. However, 
> > virus database is all but random -- you do have pattern prefixes
> > that occur more often the others. 
> 
> We really _must_ consider the worst scenario. ClamAV is a "mission
> critical" application and we cannot depend on our expectations.
> 

I'm not sure if I made it clear: I never assumed that I cannot observe
certain pattern in input.  All I said was that the database is NOT
random (it might seem like it is, but it's really not because
it is _finite_).  So, having really large, really random database
is a good test case for WORST case memory usage/performance.  This has
nothing to do with input.


-- 
  Eugene Miretskiy <[EMAIL PROTECTED]>
  INVISION.COM, INC.  (631) 543-1000
  www.invision.net  /  www.longisland.com 

Attachment: pgp00000.pgp
Description: PGP signature

Reply via email to