Hi Matt

Thanks for your detailed explanation on how signature gets stored and
interpreted.

I was looking up the codes in libclamav to see what data formats get used
for string compare. Some backtracking from cli_bm_scanbuff took me to str.c
where I see there is a function" cli_hex2str", which if I understand
correctly maps two hexs to one character (unsigned char). Would it fair to
speculate that this function is used by the clamav engine to map two hexs
read from a signature or scanned file into one char for string matching
purposes?

Thank you..


On Sat, Mar 23, 2013 at 11:02 AM, Matt Olney <mol...@sourcefire.com> wrote:

> Well....data is data.  There is no difference (from a storage perspective)
> from an executable with an "inc ecx" instruction or a text document with an
> "A".  Both are represented by the value 0x41.  So from Clam's perspective,
> a signature matching a single A would be identical to a signature that
> detected a single "inc ecx" instruction.  Both would look for 41.
>
> In short your statement "some files are hex and some are character-based"
> isn't really accurate.  At the risk of painting with a broad brush, I would
> say that all files are stored as a series of values, a series of bytes.
>  How you display them is different.  When I used 010 Editor to view a file
> as hex, I get a set of ascii-hex representations.  When I look at a file
> with a web-browser I get ascii text.  But underlying all of that is the
> same idea, a set of bytes.  And that is how ClamAV treats all files.
>
> A signature with a 41 in it would be converted in memory to look for 0x41,
> a single byte of value 0x41.  A signature written like that would detect an
> executable or pdf or a flash or anything that has 0x41 in the data.
>
> Hope that answers your question.
>
> Matt
>
>
> On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
> kvaid...@andrew.cmu.edu> wrote:
>
> > Hi
> >
> > I have a basic question. Most body-based signatures are hex based(lets
> > focus on fixed string signatures alone for simplicity), whereas some of
> the
> > files are hex(EXE) or character-based(HTML).
> >
> > In the code I see unsigned chars used predominantly to represent patterns
> > and file contents. At the very core, do the string matching algorithms,
> > mainly extended Boyer Moore, I would like to understand how the datatypes
> > gets manipulated.
> >
> > 1) Do the character based files get translated to hex to compare with
> body
> > based signatures?
> >
> > 2) Does the signature get treated as a string of chars?
> > If yes,
> > Does a toy signature "fe" gets treated as two chars(8 bits each) for "f"
> > and "e" (or)
> > Does the code read the signature "fe" and maps into one character based
> on
> > the ASCII table (for example)?
> >
> > Thank you..
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net

Reply via email to